background preloader

Hadoop

Facebook Twitter

PoweredBy - Hadoop Wiki. This page documents an alphabetical list of institutions that are using Hadoop for educational or production uses.

PoweredBy - Hadoop Wiki

Companies that offer services on or based around Hadoop are listed in Distributions and Commercial Support. Please include details about your cluster hardware and size. Entries without this may be mistaken for spam references and deleted. To add entries you need write permission to the wiki, which you can get by subscribing to the common-dev@hadoop.apache.org mailing list and asking for permissions on the wiki account username you've registered yourself as. If you are using Apache Hadoop in production you ought to consider getting involved in the development process anyway, by filing bugs, testing beta releases, reviewing the code and turning your notes into shared documentation.

Contents. Which freaking Hadoop engine should I use? In 2015, Hadoop no longer means MapReduce on HDFS.

Which freaking Hadoop engine should I use?

Instead, it refers to a whole ecosystem of technologies for working with “unstructured,” semi-structured, and structured data for complex processing at scale. This also now includes streaming use cases, which can be massively parallelized or happen in “real time” (which today means many different things ... other than traditional RTOS-style “real time”). The streaming Spark crowd now likes to contrasts itself from the Hadoop -- or more specifically, the YARN -- crowd. The Hadoop Ecosystem Table. Apache Storm and Kafka Together: A Real-time Data Refinery. Hortonworks Data Platform’s YARN-based architecture enables multiple applications to share a common cluster and data set while ensuring consistent levels of response made possible by a centralized architecture.

Apache Storm and Kafka Together: A Real-time Data Refinery

Hortonworks led the efforts to on-board open source data processing engines, such as Apache Hive, HBase, Accumulo, Spark, Storm and others, on Apache Hadoop YARN. In this blog, we will focus on one of those data processing engines—Apache Storm—and its relationship with Apache Kafka. I will describe how Storm and Kafka form a multi-stage event processing pipeline, discuss some use cases, and explain Storm topologies.

An oil refinery takes crude oil, distills it, processes it and refines it into useful finished products such as the gas that we buy at the pump. We can think of Storm with Kafka as a similar refinery, but data is the input. Apache Storm is a distributed real-time computation engine that reliably processes unbounded streams of data. A. Learn More Read These Blog Posts. Netflix is open sourcing tools for analyzing data in Hadoop. The data team at Netflix is opening sourcing some of the tools it uses to analyze data stored in Hadoop.

Netflix is open sourcing tools for analyzing data in Hadoop

The overall open source project is called Surus, and it focuses on user-defined functions (or UDFs) that Netflix has built for the Apache Hive and Pig, two higher-level frameworks that make it easier to query Hadoop data and write data-processing jobs. The first tool Netflix has released as part of Surus is a Pig function, called ScorePMML, for scoring predictive models at scale. Within Netflix, the goal was to standardize the process of taking a model someone has built using R, for example then tested on a small dataset, and then running it against a much larger dataset in Hadoop and possibly rolling it out as a production model.

According to the blog post introducing Surus and ScorePMML, future releases will includes tools for tasks such as pattern recognition and outlier detection. Hadoop needs a better front-end for business users. Whether you’re running it on premises or in the cloud, Hadoop leaves a lot to be desired in the ease-of-use department.

Hadoop needs a better front-end for business users

The Hadoop offerings on the three major cloud platforms (Amazon’s Elastic MapReduce — EMR, Microsoft’s Azure HDInsight and Google Compute Engine’s Click-to-Deploy Hadoop) have their warts. And the three major on-premises distributions (Cloudera CDH, Hortonworks HDP and MapR) can be formidable adversaries to casual users as well. See prompt The root of Hadoop’s ease-of-use problem, no matter where you run it, is that it’s essentially a command line tool. In the enterprise, people are used to graphical user interfaces (GUIs), be they in desktop applications or in the Web browser, that make things fairly simple to select, configure, and run. Distributed SQL Query Engine for Big Data.

Spark Summit. Hortonworks Develops, Distributes and Supports Apache Hadoop. Welcome to Apache HCatalog! Hadoop. Revolution speeds stats on Hadoop clusters. High performance access to file storage Revolution Analytics, the company that is extending R, the open source statistical programming language, with proprietary extensions, is making available a free set of extensions that allow its R engine to run atop Hadoop clusters.

Revolution speeds stats on Hadoop clusters

Oracle rolls its own NoSQL and Hadoop. High performance access to file storage OpenWorld There's no shortage of ego at Oracle, as evidenced by the effusion of confidence behind the company's OpenWorld announcement of the not-so-humbly named Big Data Appliance.

Oracle rolls its own NoSQL and Hadoop

Hadoop input format for swallowing entire files. The Hadoop Tutorial Series « Java. Internet. Algorithms. Ideas. A progressive set of tutorials written along the way around the Hadoop Apache Project: Issue #1: Setting Up Your MapReduce Learning Playground Issue #2: Getting Started With (Customized) Partitioning Issue #3: Counters In Action Issue #4: To Use Or Not To Use A Combiner Your comments/critics/remarks are more than welcomed. Hadoop’s civil war: Does it matter who contributes most? — Cloud Computing News.