background preloader

Mikeaddison93/spark-csv

Mikeaddison93/spark-csv

https://github.com/mikeaddison93/spark-csv

Related:  Spark-ToolsDataframe Rulebase ToolsSpark-SQL and SqarqlFile Processing how2s

Configuring IPython Notebook Support for PySpark · John Ramey 01 Feb 2015 Apache Spark is a great way for performing large-scale data processing. Lately, I have begun working with PySpark, a way of interfacing with Spark through Python. Build a CEP App on Apache Spark and Drools Combining CDH with a business execution engine can serve as a solid foundation for complex event processing on big data. Event processing involves tracking and analyzing streams of data from events to support better insight and decision making. With the recent explosion in data volume and diversity of data sources, this goal can be quite challenging for architects to achieve. Complex event processing (CEP) is a type of event processing that combines data from multiple sources to identify patterns and complex relationships across various events. The value of CEP is that it helps identify opportunities and threats across many data sources and provides real-time alerts to act on them.

Complex Event Processing using Spark Streaming and SparkSQL Introduction Apache Spark has come a long way in just the last year. It now boasts the ability to not only process streams of data at scale, but to “query” that data at scale using SQL-like syntax. This ability makes Spark a viable alternative to established Complex Event Processing platforms and provides advantages over other open source stream processing systems. Especially with regards to the the former, Spark will now allow for the creation of “rules” that can run within stream “windows” of time and make decisions with the ease of SQL queries.

Spark examples: how to work with CSV / TSV files (performing selection and projection operation) One of the most simple format your files may have in order to start playing with Spark, is CSV (comma separated value or TSV tab…). Let’s see how to perform, over a set of this files, some operation. As usual, I suggest you to create a Scala Maven project on Eclipse, compile a jar and execute it on the cluster with the spark-submit command. SparkR by amplab-extras SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. NOTE: As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4) due early summer 2015.

Install, Setup, and Test Spark and Cassandra on Mac OS X Install, Setup, and Test Spark and Cassandra on Mac OS X This Gist assumes you already followed the instructions to install Cassandra, created a keyspace and table, and added some data. Install Apache Spark Spark SQL for Real-Time Analytics Apache Spark is the hottest topic in Big Data. This tutorial discusses why Spark SQL is becoming the preferred method for Real Time Analytics and for next frontier, IoT (Internet of Things). By Sumit Pal and Ajit Jaokar, (FutureText). This article is part of the forthcoming Data Science for Internet of Things Practitioner course in London. If you want to be a Data Scientist for the Internet of Things, this intensive course is ideal for you.

Run Spark and Spark SQL on Amazon Elastic MapReduce : Articles & Tutorials : Amazon Web Services With the proliferation of data today, a common scenario is the need to store large data sets, process that data set iteratively, and discover insights using low-latency relational queries. Using the Hadoop Distributed File System (HDFS) and Hadoop MapReduce components in Apache Hadoop, these workloads can be distributed over a cluster of computers. By distributing the data and processing over many computers, your results return quickly even over large datasets because multiple computers share the load required for processing. However, with Hadoop MapReduce, the speed and flexibility of querying that dataset is constrained by the time it takes for disk I/O operations and the two step (map and reduce steps) batch processing framework. Apache Spark, an open-source cluster computing system optimized for speed, can provide much faster performance and flexibility than Hadoop MapReduce.

A Foray into Spark and Scala Apache Spark is a new wave in Big Data computing, an alternative to technologies such as Hadoop. I was recently watching someone analyze log files of image URL requests using shell scripts to create a MySQL database and thought it might be an interesting exercise to try it in Spark as well. Hadoop versus Spark So what is Spark and how is it different from Hadoop? Hadoop is an older, more mature technology. It is really a collection of technologies such as a distributed resilient file system (HDFS), job tracking, and a parallelized map-reduce engine. SparkR by amplab-extras SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. NOTE: As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4) due early summer 2015. You can contribute and follow SparkR developments on the Apache Spark mailing lists and issue tracker. NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. Initial support for Spark in R be focussed on high level operations instead of low level ETL.

Spark Tutorial (Part I): Setting Up Spark and IPython Notebook within 10 minutes Introduction: The objective of this post is to share a step-by-step procedure of setting up data science local environment consisted of IPython Notebook (Anaconda Analystics) with ability of scaling up by parallizing/distributing tasks through Apache Spark local machine or a remote cluster. Anaconda Analystics is one of the most popular Python IDE among Python data scienctist community, featuring the interactivity of web-based IPython Notebook (gallery), the ease of setting-up and inclusion of a comphresive collection of built-in python modules. At the other hand, Apache Spark is described as a lightning-fast cluster computing and a complementary piece to Apache Hadoop. Deep Dive into Spark SQL’s Catalyst Optimizer Spark SQL is one of the newest and most technically involved components of Spark. It powers both SQL queries and the new DataFrame API. At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasiquotes) in a novel way to build an extensible query optimizer. We recently published a paper on Spark SQL that will appear in SIGMOD 2015 (co-authored with Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J.

Spark: Connecting to a jdbc data-source using dataframes So far in Spark, JdbcRDD has been the right way to connect with a relational data source. In Spark 1.4 onwards there is an inbuilt datasource available to connect to a jdbc source using dataframes. Spark introduced dataframes in version 1.3 and enriched dataframe API in 1.4. RDDs are a unit of compute and storage in Spark but lack any information about the structure of the data i.e. schema. Dataframes combine RDDs with Schema and this small addition makes them very very powerful. You can read more about dataframes here. 12.5. How to Process a CSV File - Scala Cookbook [Book] Combine Recipe 12.1 with Recipe 1.3. Given a simple CSV file like this named finance.csv: January, 10000.00, 9000.00, 1000.00 February, 11000.00, 9500.00, 1500.00 March, 12000.00, 10000.00, 2000.00 you can process the lines in the file with the following code: object CSVDemo extends App { println("Month, Income, Expenses, Profit") val bufferedSource = io.Source.fromFile("/tmp/finance.csv") for (line <- bufferedSource.getLines) { val cols = line.split(",").map(_.trim) // do whatever you want with the columns here println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}") } bufferedSource.close } The magic in that code is this line:

Related: