background preloader

Apache Spark™ - Lightning-Fast Cluster Computing

Apache Spark™ - Lightning-Fast Cluster Computing

Related:  Spark-ToolsHadoop

Spark Tutorial (Part I): Setting Up Spark and IPython Notebook within 10 minutes Introduction: The objective of this post is to share a step-by-step procedure of setting up data science local environment consisted of IPython Notebook (Anaconda Analystics) with ability of scaling up by parallizing/distributing tasks through Apache Spark local machine or a remote cluster. Anaconda Analystics is one of the most popular Python IDE among Python data scienctist community, featuring the interactivity of web-based IPython Notebook (gallery), the ease of setting-up and inclusion of a comphresive collection of built-in python modules. At the other hand, Apache Spark is described as a lightning-fast cluster computing and a complementary piece to Apache Hadoop. The Hadoop ecosystem: the (welcome) elephant in the room (infographic) To say Hadoop has become really big business would be to understate the case. At a broad level, it’s the focal point of a immense big data movement, but Hadoop itself is now a software and services market of its very own. In this graphic, we aim to map out the current ecosystem of Hadoop software and services — application and infrastructure software, as well as open source projects — and where those products fall in terms of use cases and delivery model. Click on a company name for more information about how they are using this technology. A couple of points about the methodology might be valuable: The first is that these are products and projects that are built with Hadoop in mind and that aim to either extend its utility in some way or expose its core functions in a new manner. This is the second installment of our four-part series on the past, present and future of Hadoop.

Announcing Spark Packages Today, we are happy to announce Spark Packages ( a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content.

An Introduction to the ELK stack By combining the massively popular Elasticsearch, Logstash and Kibana, Elasticsearch Inc has created an end-to-end stack that delivers actionable insights in real time from almost any type of structured and unstructured data source. Built and supported by the engineers behind each of these open source products, the Elasticsearch ELK stack makes searching and analyzing data easier than ever before. Thousands of organizations worldwide use these products for an endless variety of business critical functions. And we'd like to show you how the ELK stack will make your life better, too.

Install, Setup, and Test Spark and Cassandra on Mac OS X Install, Setup, and Test Spark and Cassandra on Mac OS X This Gist assumes you already followed the instructions to install Cassandra, created a keyspace and table, and added some data. Install Apache Spark How Hadoop Works? HDFS case study The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

SparkR by amplab-extras SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. NOTE: As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4) due early summer 2015. You can contribute and follow SparkR developments on the Apache Spark mailing lists and issue tracker. NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. Initial support for Spark in R be focussed on high level operations instead of low level ETL.

Configuring IPython Notebook Support for PySpark · John Ramey 01 Feb 2015 Apache Spark is a great way for performing large-scale data processing. Lately, I have begun working with PySpark, a way of interfacing with Spark through Python. Comparing Pattern Mining on a Billion Records with HP Vertica and Hadoop Pattern mining can help analysts discover hidden structures in data. Pattern mining has many applications—from retail and marketing to security management. For example, from a supermarket data set, you may be able to predict whether customers who buy Lay’s potato chips are likely to buy a certain brand of beer. Similarly, from network log data, you may determine groups of Web sites that are visited together or perform event analysis for security enforcement.

SPARK Plugin for Eclipse: Installation Instructions and User Guide Installation of the plugin is fairly simple, but it will require you to download and setup the Eclipse (version 3.0) program. (If you already have installed Eclipse 3.0, skip ahead to Install the SPARK-IDE plugin. 2.x versions of Eclipse will not work with this plugin) If you do not already have Eclipse, please download it from At the time of these instructions, the most recent release is Eclipse 3.0.1, which can be downloaded from here:

Smart Data Access with HADOOP  HIVE  “SAP HANA smart data access enables remote data to be accessed as if they are local tables in SAP HANA, without copying the data into SAP HANA. Not only does this capability provide operational and cost benefits, but most importantly it supports the development and deployment of the next generation of analytical applications which require the ability to access, synthesize and integrate data from multiple systems in real-time regardless of where the data is located or what systems are generating it.” Reference: Section 2.4.2 How to run SparkR in Eclipse on Windows environment SparkR has been officially merged into Apache Spark 1.4.0 on June 11, 2015. I could not find any direct post or link in google that shows how to run SparkR in Eclipse on Windows environment. There are a couple of posts showing how to run SparkR in command line on Windows, another a few on running SparkR in RStudio. I have set up a few Windows machines locally to run SparkR in Eclipse and in RStudio.