Announcing Spark Packages Today, we are happy to announce Spark Packages ( a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content.
An Introduction to the ELK stack By combining the massively popular Elasticsearch, Logstash and Kibana, Elasticsearch Inc has created an end-to-end stack that delivers actionable insights in real time from almost any type of structured and unstructured data source. Built and supported by the engineers behind each of these open source products, the Elasticsearch ELK stack makes searching and analyzing data easier than ever before. Thousands of organizations worldwide use these products for an endless variety of business critical functions. And we'd like to show you how the ELK stack will make your life better, too.
Install, Setup, and Test Spark and Cassandra on Mac OS X Install, Setup, and Test Spark and Cassandra on Mac OS X This Gist assumes you already followed the instructions to install Cassandra, created a keyspace and table, and added some data. Install Apache Spark How Hadoop Works? HDFS case study The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
SparkR by amplab-extras SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. NOTE: As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4) due early summer 2015. You can contribute and follow SparkR developments on the Apache Spark mailing lists and issue tracker. NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. Initial support for Spark in R be focussed on high level operations instead of low level ETL.
Configuring IPython Notebook Support for PySpark · John Ramey 01 Feb 2015 Apache Spark is a great way for performing large-scale data processing. Lately, I have begun working with PySpark, a way of interfacing with Spark through Python. Comparing Pattern Mining on a Billion Records with HP Vertica and Hadoop Pattern mining can help analysts discover hidden structures in data. Pattern mining has many applications—from retail and marketing to security management. For example, from a supermarket data set, you may be able to predict whether customers who buy Lay’s potato chips are likely to buy a certain brand of beer. Similarly, from network log data, you may determine groups of Web sites that are visited together or perform event analysis for security enforcement.
SPARK Plugin for Eclipse: Installation Instructions and User Guide Installation of the plugin is fairly simple, but it will require you to download and setup the Eclipse (version 3.0) program. (If you already have installed Eclipse 3.0, skip ahead to Install the SPARK-IDE plugin. 2.x versions of Eclipse will not work with this plugin) If you do not already have Eclipse, please download it from www.eclipse.org. At the time of these instructions, the most recent release is Eclipse 3.0.1, which can be downloaded from here:
Smart Data Access with HADOOP HIVE “SAP HANA smart data access enables remote data to be accessed as if they are local tables in SAP HANA, without copying the data into SAP HANA. Not only does this capability provide operational and cost benefits, but most importantly it supports the development and deployment of the next generation of analytical applications which require the ability to access, synthesize and integrate data from multiple systems in real-time regardless of where the data is located or what systems are generating it.” Reference: Section 2.4.2 How to run SparkR in Eclipse on Windows environment SparkR has been officially merged into Apache Spark 1.4.0 on June 11, 2015. I could not find any direct post or link in google that shows how to run SparkR in Eclipse on Windows environment. There are a couple of posts showing how to run SparkR in command line on Windows, another a few on running SparkR in RStudio. I have set up a few Windows machines locally to run SparkR in Eclipse and in RStudio.