background preloader

Apache Spark™ - Lightning-Fast Cluster Computing

Apache Spark™ - Lightning-Fast Cluster Computing
Related:  HadoopSpark-Tools

The Hadoop ecosystem: the (welcome) elephant in the room (infographic) To say Hadoop has become really big business would be to understate the case. At a broad level, it’s the focal point of a immense big data movement, but Hadoop itself is now a software and services market of its very own. In this graphic, we aim to map out the current ecosystem of Hadoop software and services — application and infrastructure software, as well as open source projects — and where those products fall in terms of use cases and delivery model. Click on a company name for more information about how they are using this technology. A couple of points about the methodology might be valuable: The first is that these are products and projects that are built with Hadoop in mind and that aim to either extend its utility in some way or expose its core functions in a new manner. This is the second installment of our four-part series on the past, present and future of Hadoop.

mikeaddison93/spark-avro How Hadoop Works? HDFS case study The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. HDFS exposes a file system namespace and allows user data to be stored in files. HDFS analysis After the analysis of the Hadoop with JArchitect, here’s the dependency graph of the hdfs project. To achieve its job, hdfs uses many third party libs like guava, jetty, jackson and others. HDFS use mostly rt, hadoop-common and protobuf libraries. Have more featuresMore performentMore secure I-DataNode Startup How data is managed? NameNode NameNodeRpcServer

mikeaddison93/sparql-playground Presto | Distributed SQL Query Engine for Big Data python 2.7 - Why can't PySpark find py4j.java_gateway? Comparing Pattern Mining on a Billion Records with HP Vertica and Hadoop Pattern mining can help analysts discover hidden structures in data. Pattern mining has many applications—from retail and marketing to security management. For example, from a supermarket data set, you may be able to predict whether customers who buy Lay’s potato chips are likely to buy a certain brand of beer. Similarly, from network log data, you may determine groups of Web sites that are visited together or perform event analysis for security enforcement. A pattern mining algorithm Frequent patterns are items that occur often in a data set. Instead of describing FP-growth in detail, we list the main steps from a practitioner’s perspective. Create transactions of itemsCount occurrence of item setsSort item sets according to their occurrenceRemove infrequent itemsScan DB and build FP-treeRecursively grow frequent item sets Let’s use an example to illustrate these steps. Parallel pattern mining on the HP Vertica Analytics Platform The real test: a billion records, and, of course, Hadoop

Spark Tutorial (Part I): Setting Up Spark and IPython Notebook within 10 minutes | Yi Zhang Introduction: The objective of this post is to share a step-by-step procedure of setting up data science local environment consisted of IPython Notebook (Anaconda Analystics) with ability of scaling up by parallizing/distributing tasks through Apache Spark local machine or a remote cluster. Anaconda Analystics is one of the most popular Python IDE among Python data scienctist community, featuring the interactivity of web-based IPython Notebook (gallery), the ease of setting-up and inclusion of a comphresive collection of built-in python modules. At the other hand, Apache Spark is described as a lightning-fast cluster computing and a complementary piece to Apache Hadoop. For Python user, there are a number advantages of using web-based IPython Notebook to conduct data science projects rather than using the console-based ipython/pyspark. Install Apache Spark and Anaconda (IPython Notebook) on the local machine Optional Prerequisites: Reference

Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query... Install, Setup, and Test Spark and Cassandra on Mac OS X Install, Setup, and Test Spark and Cassandra on Mac OS X This Gist assumes you already followed the instructions to install Cassandra, created a keyspace and table, and added some data. Install Apache Spark brew install apache-spark Get the Spark Cassandra Connector Clone the download script from Github Gist: git clone Rename the cloned directory: mv b700fe70f0025a519171 connector Run the script: bash Start the Spark Master and a Worker . Testing the install Make a note of the path to your connector directory. Open the Spark Shell with the connector: spark-shell --driver-class-path $(echo path/to/connector/*.jar | sed 's/ /:/g') Wait for everything to load. scala > You'll need to stop the default SparkContext, since you'll create your own with the script. scala > sc.stop Once that is finished, get ready to paste the script in: scala > :paste Make sure you are on a new line after 'table.count', then hit ctl-D to get out of paste mode.

OpenTSDB - A Distributed, Scalable Monitoring System SPARK Plugin for Eclipse: Installation Instructions and User Guide Installation of the plugin is fairly simple, but it will require you to download and setup the Eclipse (version 3.0) program. (If you already have installed Eclipse 3.0, skip ahead to Install the SPARK-IDE plugin. 2.x versions of Eclipse will not work with this plugin) If you do not already have Eclipse, please download it from At the time of these instructions, the most recent release is Eclipse 3.0.1, which can be downloaded from here: If you are a Java developer, then the "Eclipse SDK" release is probably best for you. The download process is likely to take a long time. On the Windows platform, there is no installer, so I assume the same is true for other platforms as well. If you have a 2.x version of Eclipse already installed and wish to continue using it, install Eclipse 3.0 in a separate directory. NOTE: You must have an account on the AIC CVS server Open Window->Preferences, then select 'SPARK'.