Hadoop

MR2 and YARN Briefly Explained. With CDH4 onward, the Apache Hadoop component introduced two new terms for Hadoop users to wonder about: MR2 and YARN. Unfortunately, these terms are mixed up so much that many people are confused about them. Do they mean the same thing, or not? This post aims to clarify these two terms. What is YARN? YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN provides the daemons and APIs necessary to develop generic distributed applications of any kind, handles and schedules resource requests (such as memory and CPU) from such applications, and supervises their execution.

YARN’s execution model is more generic than the earlier MapReduce implementation. What is MR2? With the advent of YARN, there is no longer a single JobTracker to run jobs and a TaskTracker to run tasks of the jobs. Summary. 10 MapReduce Tips. This piece is based on the talk “Practical MapReduce” that I gave at Hadoop User Group UK on April 14. 1. Use an appropriate MapReduce language There are many languages and frameworks that sit on top of MapReduce, so it’s worth thinking up-front which one to use for a particular problem. There is no one-size-fits-all language; each has different strengths and weaknesses. Java: Good for: speed; control; binary data; working with existing Java or MapReduce libraries.Pipes: Good for: working with existing C++ libraries.Streaming: Good for: writing MapReduce programs in scripting languages.Dumbo (Python), Happy (Jython), Wukong (Ruby), mrtoolkit (Ruby): Good for: Python/Ruby programmers who want quick results, and are comfortable with the MapReduce abstraction.Pig, Hive, Cascading: Good for: higher-level abstractions; joins; nested data. 2.

Are you generating large, unbounded files, like log files? Answers to these questions determine how your store and process data using HDFS. 3. Splittable. Hadoop Streaming. Hadoop Streaming Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /bin/wc How Does Streaming Work In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized.

This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer. You can supply a Java class as the mapper and/or the reducer. . #!

Python

MapReduce. Hbase. Hbase. Writing An Hadoop MapReduce Program In Python. In this tutorial I will describe how to write a simple MapReduce program for Hadoop in the Python programming language. Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++ (the latter since version 0.14.1).

However, Hadoop’s documentation and the most prominent Python example on the Hadoop website could make you think that you must translate your Python code using Jython into a Java jar file. Obviously, this is not very convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue of the Jython approach is the overhead of writing your Python program in such a way that it can interact with Hadoop – just have a look at the example in $HADOOP_HOME/src/examples/python/WordCount.py and you see what I mean.

Our program will mimick the WordCount, i.e. it reads text files and counts how often words occur. Map step: mapper.py #! #! #! Hadoop Streaming Made Simple using Joins and Keys with Python | All Things Hadoop. There are a lot of different ways to write MapReduce jobs!!! Sample code for this post I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case). When doing streaming with Hadoop you do have a few library options. If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

I like working under the hood myself and getting down and dirty with the data and here is how you can too. Lets start first with defining two simple sample data sets. Data set 1: countries.dat name|key Data set 2: customers.dat name|type|country To-do this you need to: 1) Join the data sets 2) Key on country 3) Count type of customer per country 4) Output the results Great! HadoopStreaming. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Usage: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -combiner <JavaClassName> Combiner has to be a Java class -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file -dfs <h:p>|local Optional.

Override DFS configuration -jt <h:p>|local Optional. Override JobTracker configuration -additionalconfspec specfile Optional. -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. Practical Help See Also. HadoopStreaming/AlternativeInterfaces. CDH4 Installation Guide. Hadoop Streaming. CDH Downloads. Tips and Guidelines. This section provides solutions to some performance problems, and describes configuration best practices. Important If you are running CDH over 10Gbps Ethernet, improperly set network configuration or improperly applied NIC firmware or drivers can noticeably degrade performance.

Work with your network engineers and hardware vendors to make sure that you have the proper NIC firmware, drivers, and configurations in place and that your network performs properly. Cloudera recognizes that network setup and upgrade are challenging problems, and will make best efforts to share any helpful experiences. Disabling Transparent Hugepage Compaction Most Linux platforms supported by CDH4 include a feature called transparent hugepage compaction which interacts poorly with Hadoop workloads and can seriously degrade performance.

Symptom: top and other system monitoring tools show a large percentage of the CPU usage classified as "system CPU". What to do: # echo 'never' > $ sudo sh -c "echo 'never' > " Note. Hadoop Streaming Made Simple using Joins and Keys with Python | All Things Hadoop. Hadoop: My Experience with Cloudera and MapR. A few months back we started to endeavor on a new Hadoop cluster @ medialets We have been live with Hadoop in production since April 2010 and we are still running CDH2. Our current hosting provider does not have a very ideal implementation for us where our 36 nodes are spread out across an entire data center and 5 networks each with 1 GB link. While there are issues with this type of setup we have been able to organically grow our cluster (started at 4 nodes) which powers 100% of our batch analytics for what is now hundreds of millions of mobile devices.

One of our mapreduce jobs processes 30+ billion objects (about 3 TB of uncompressed data) and takes about 90 minutes to run. This jobs runs all day long contiguously. Each run ingests the data that was received while the previous job was running. One of the primary goals of our new cluster was to reduce the time these type of jobs take without having to make any code changes or increase our investment in hardware. HBase. Apache Hadoop: What are the pros and cons of using CDH instead of "raw" Apache Hadoop and its related products.