background preloader

Mikeaddison93/spark-csv: CSV data source for Spark SQL and DataFrames

Mikeaddison93/spark-csv: CSV data source for Spark SQL and DataFrames

SparkR by amplab-extras SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. NOTE: As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4) due early summer 2015. NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. Features SparkR exposes the RDD API of Spark as distributed lists in R. sc <- sparkR.init("local") lines <- textFile(sc, " wordsPerLine <- lapply(lines, function(line) { length(unlist(strsplit(line, " "))) }) In addition to lapply, SparkR also allows closures to be applied on every partition using lapplyWithPartition. SparkR automatically serializes the necessary variables to execute a function on the cluster. SparkR also allows easy use of existing R packages inside closures. Installing SparkR . Running sparkR . . .

Spark SQL and DataFrames - Spark 1.6.0 Documentation Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell, pyspark shell, or sparkR shell. One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. DataFrames A DataFrame is a distributed collection of data organized into named columns. The DataFrame API is available in Scala, Java, Python, and R. Datasets A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. Starting Point: SQLContext Creating DataFrames . . . . .

SparkR by amplab-extras SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. NOTE: As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4) due early summer 2015. You can contribute and follow SparkR developments on the Apache Spark mailing lists and issue tracker. NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. Initial support for Spark in R be focussed on high level operations instead of low level ETL. Features SparkR exposes the RDD API of Spark as distributed lists in R. sc <- sparkR.init("local") lines <- textFile(sc, " wordsPerLine <- lapply(lines, function(line) { length(unlist(strsplit(line, " "))) }) In addition to lapply, SparkR also allows closures to be applied on every partition using lapplyWithPartition. . . . .

Spark SQL DataFrames A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to relational tables with good optimization techniques. A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs. This API was designed for modern Big Data and data science applications taking inspiration from DataFrame in R Programming and Pandas in Python. Features of DataFrame Here is a set of few characteristic features of DataFrame − SQLContext SQLContext is a class and is used for initializing the functionalities of Spark SQL. The following command is used for initializing the SparkContext through spark-shell. $ spark-shell By default, the SparkContext object is initialized with the name sc when the spark-shell starts. Use the following command to create SQLContext. scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc) Example DataFrame Operations Read the JSON Document Output

How to run SparkR in Eclipse on Windows environment | ztest46 SparkR has been officially merged into Apache Spark 1.4.0 on June 11, 2015. I could not find any direct post or link in google that shows how to run SparkR in Eclipse on Windows environment. There are a couple of posts showing how to run SparkR in command line on Windows, another a few on running SparkR in RStudio. I have set up a few Windows machines locally to run SparkR in Eclipse and in RStudio. The basic steps are: 1) Install Eclipse if you have not done so. 2) Download pre-built Spark 1.4.0 . such as Pre-built for Hadoop 2.4 or later. 3) Install R. 4) Verify if JDK (version 1.5 or later) is installed in your Windows pc. 5) Download Hodoop Windows’ native component: from this site: at the section: Download SrcCodes hadoop-common-2.2.0/bin: zip Download the hadoop-common-2.2.0-bin-master.zip file, unzip it and copy the bin folder to a folder such as C:\_SparkJar\hadoop. 6) Start the Eclipse, Set up the JDK as default Java runtime. Click on OK, then Apply button. > head(SparkDf) Jun Wu

Inferring the Schema using Reflection This method uses reflection to generate the schema of an RDD that contains specific types of objects. The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and they become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. Example Let us consider an example of employee records in a text file named employee.txt. Given Data − Take a look into the following data of a file named employee.txt placed it in the current respective directory where the spark shell point is running. 1201, satish, 25 1202, krishna, 28 1203, amith, 39 1204, javed, 23 1205, prudvi, 23 The following examples explain how to generate a schema using Reflections. Start the Spark Shell Start the Spark Shell using following command. $ spark-shell Create SQLContext Import SQL Functions Output

R: The R Project for Statistical Computing Programmatically Specifying the Schema The second method for creating DataFrame is through programmatic interface that allows you to construct a schema and then apply it to an existing RDD. We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD.Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext. Example Let us consider an example of employee records in a text file named employee.txt. Given Data − Look at the following data of a file named employee.txt placed in the current respective directory where the spark shell point is running. 1201, satish, 25 1202, krishna, 28 1203, amith, 39 1204, javed, 23 1205, prudvi, 23 Follow the steps given below to generate a schema programmatically. Open Spark Shell Start the Spark shell using following example. $ spark-shell Create SQLContext Object Generate SQLContext using the following command.

Spark SQL Hive Tables Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext. When not configured by the hive-site.xml, the context automatically creates a metastore called metastore_db and a folder called warehouse in the current directory. Consider the following example of employee record using Hive tables. employee.txt − Place it in the current directory where the spark-shell is running. 1201, satish, 25 1202, krishna, 28 1203, amith, 39 1204, javed, 23 1205, prudvi, 23 Start the Spark Shell First, we have to start the Spark Shell. $ su password: #spark-shell scala> Create SQLContext Object Use the following command for initializing the HiveContext into the Spark Shell scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) Create Table using HiveQL Select Fields from the Table

Spark SQL JSON Datasets Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String or a JSON file. Spark SQL provides an option for querying JSON data along with auto-capturing of JSON schemas for both reading and writing data. Example Let us consider an example of employee records in a text file named employee.json. Read a JSON document named employee.json with the following content and generate a table based on the schema in the JSON document. employee.json − Place this file into the directory where the current scala> pointer is located. Let us perform some Data Frame operations on given data. DataFrame Operations DataFrame provides a domain-specific language for structured data manipulation. Follow the steps given below to perform DataFrame operations − Read JSON Document First of all, we have to read the JSON document. scala> val dfs = sqlContext.read.json("employee.json") Use printSchema Method

Spark SQL Parquet Files Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows − Columnar storage limits IO operations.Columnar storage can fetch specific columns that you need to access.Columnar storage consumes less space.Columnar storage gives better-summarized data and follows type-specific encoding. Spark SQL provides support for both reading and writing parquet files that automatically capture the schema of the original data. Let’s take another look at the same example of employee record data named employee.parquet placed in the same directory where spark-shell is running. Given data − Do not bother about converting the input data of employee records into parquet format. $ spark-shell Scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) Scala> val employee = sqlContext.read.json(“emplaoyee”) Scala> employee.write.parquet(“employee.parquet”) It is not possible to show you the parquet file. Open Spark Shell $ spark-shell

Spark SQL Quick Guide Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop framework is based on a simple programming model (MapReduce) and it enables a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program. Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process. As against a common belief, Spark is not a modified version of Hadoop and is not, really, dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement Spark. Spark uses Hadoop in two ways – one is storage and second is processing. Apache Spark Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. Evolution of Apache Spark Features of Apache Spark Apache Spark has following features. Spark SQL

Using Apache Spark DataFrames for Processing of Tabular Data This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox. The new Spark DataFrames API is designed to make big data processing on tabular data easier. A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL. DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases. In this post, you’ll learn how to: Load data into Spark DataFramesExplore data with Spark SQL This post assumes a basic understanding of Spark concepts. This tutorial will run on the MapR v5.0 Sandbox, which includes Spark 1.3 You can download the code and data to run these examples from here: examples in this post can be run in the spark-shell, after launching with the spark-shell command. The sample data sets How many auctions were held? Summary

Spark SQL and DataFrames - Spark 1.3.0 Documentation Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. The DataFrame API is available in Scala, Java, and Python. All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell or the pyspark shell. Starting Point: SQLContext The entry point into all functionality in Spark SQL is the SQLContext class, or one of its descendants. val sc: SparkContext // An existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc) // this is used to implicitly convert an RDD to a DataFrame.import sqlContext.implicits from pyspark.sql import SQLContextsqlContext = SQLContext(sc) Creating DataFrames For example: . . . .

Related: