background preloader

Apache Spark™ - Lightning-Fast Cluster Computing

Apache Spark™ - Lightning-Fast Cluster Computing
Related:  HadoopSpark-Tools

The Hadoop ecosystem: the (welcome) elephant in the room (infographic) To say Hadoop has become really big business would be to understate the case. At a broad level, it’s the focal point of a immense big data movement, but Hadoop itself is now a software and services market of its very own. In this graphic, we aim to map out the current ecosystem of Hadoop software and services — application and infrastructure software, as well as open source projects — and where those products fall in terms of use cases and delivery model. Click on a company name for more information about how they are using this technology. A couple of points about the methodology might be valuable: The first is that these are products and projects that are built with Hadoop in mind and that aim to either extend its utility in some way or expose its core functions in a new manner. This is the second installment of our four-part series on the past, present and future of Hadoop.

mikeaddison93/spark-avro Druid | Interactive Analytics at Scale Welcome to Apache Flume — Apache Flume How Hadoop Works? HDFS case study The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. HDFS exposes a file system namespace and allows user data to be stored in files. HDFS analysis After the analysis of the Hadoop with JArchitect, here’s the dependency graph of the hdfs project. To achieve its job, hdfs uses many third party libs like guava, jetty, jackson and others. HDFS use mostly rt, hadoop-common and protobuf libraries. Have more featuresMore performentMore secure I-DataNode Startup How data is managed? NameNode NameNodeRpcServer

mikeaddison93/sparql-playground cidr11-bloom.pdf Sharding & IDs at Instagram Presto | Distributed SQL Query Engine for Big Data python 2.7 - Why can't PySpark find py4j.java_gateway? A plain English introduction to CAP theorem « Kaushik Sathupadi You’ll often hear about the CAP theorem which specifies some kind of an upper limit when designing distributed systems. As with most of my other introduction tutorials, lets try understanding CAP by comparing it with a real world situation. Chapter 1: “Remembrance Inc” Your new venture : Last night when your spouse appreciated you on remembering her birthday and bringing her a gift, a strange Idea strikes you. Remembrance Inc! So, your typical phone conversation will look like this: Customer : Hey, Can you store my neighbor’s birthday? Chapter 2 : You scale up: Your venture gets funded by YCombinator. And there starts the problem. Your start with a simple plan: You and your wife both get an extension phone Customers still dial (555)–55-REMEM and need to remember only one number A pbx will route the a customers call to whoever is free and equally Chapter 3 : You have your first “Bad Service” : Jhon: Hey You: Glad you called “Remembrance Inc!”. How did that happen? ” look” , you tell her..