background preloader

Hadoop

Facebook Twitter

Because Hadoop isn’t perfect: 8 ways to replace HDFS. Hadoop is on its way to becoming the de facto platform for the next-generation of data-based applications, but it’s not without flaws.

Because Hadoop isn’t perfect: 8 ways to replace HDFS

Ironically, one of Hadoop’s biggest shortcomings now is also one of its biggest strengths going forward — the Hadoop Distributed File System. Within the Apache Software Foundation, HDFS is always improving in terms of performance and availability. Honestly, it’s probably fine for the majority of Hadoop workloads that are running in pilot projects, skunkworks projects or generally non-demanding environments. And technologies such as HBase that are built atop HDFS speak to its versatility as storage system even for non-MapReduce applications. But if the growing number of options for replacing HDFS signifies anything, it’s that HDFS isn’t quite where it needs to be. Cassandra (DataStax) Not a file system at all but an open source, NoSQL key-value store, Cassandra has become a viable alternative to HDFS for web applications that rely on fast data access. Ceph. Best Practices for Selecting Apache Hadoop Hardware.

We get asked a lot of questions about how to select Apache Hadoop worker node hardware.

Best Practices for Selecting Apache Hadoop Hardware

During my time at Yahoo! , we bought a lot of nodes with 6*2TB SATA drives, 24GB RAM and 8 cores in a dual socket configuration. This has proven to be a pretty good configuration. This year, I’ve seen systems with 12*2TB SATA drives, 48GB RAM and 8 cores in a dual socket configurations. We will see a move to 3TB drives this year. What configuration makes sense for any given organization is driven by such ratios as the storage-to-compute ratio of your workload and other factors that cannot be answered in a generic way. The key for Hadoop clusters is to buy quality commodity equipment. On CPU: It helps to understand your workload, but for most systems I recommend sticking with medium clock speeds and no more than 2 sockets.

Technology. The Next Generation of Apache Hadoop MapReduce · Yahoo! Hadoop Blog. ## Overview In the Big Data business running fewer larger clusters is cheaper than running more small clusters.

The Next Generation of Apache Hadoop MapReduce · Yahoo! Hadoop Blog

Larger clusters also process larger data sets and support more jobs and users. The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. ## Background The current implementation of the Hadoop MapReduce framework is showing it’s age. Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. ## Requirements ## Architecture.

PEGASUS: Peta-Scale Graph Mining System. Pegasus An award-winning, open-source, graph-mining system with massive scalability.

PEGASUS: Peta-Scale Graph Mining System

Analyze petabytes of graph data with ease. English, all platforms We won the Open Source Software World Challenge, Silver Award. Home - dumbo - GitHub. Disco. Cascading.