EC2

> >

HadoopDebug < Storage < TWiki. This twiki is setup to help diagnose and solve some typical Hadoop issues. It is often useful to run the FUSE mount in debug mode to identify specific errors and problems with the FUSE mount. /usr/bin/hdfs -o server=namenode.fqdn,port=9000,rdbuffer=131072,allow_other -d /mnt/hadoop/ Note the use of the -d switch to put the mount in debug mode.

CTRL-C to quit the mount after testing. Sometimes diagnosing gridftp server or hadoop errors require you to run the standalone gridftp server which runs in a debug mode. On the Server gridftp-hdfs-standalone This will start a new server in the terminal's foreground; it should also include Hadoop-level errors. On the Client, note the use of port 5002 source $VDT_LOCATION/setup.sh dd if=/dev/zero of=testfile.zero count=10000 bs=1024 globus-url-copy gridftp-hdfs buffers log messages before writing them to the log file. Log_module stdio:buffer=0 Problem Solution. Hadoop 0.20.205 Patches - Amazon Elastic MapReduce. Using Cloudera’s Hadoop AMIs to process EBS datasets on EC2. Lineland: Hadoop on EC2 - A Primer. As more and more companies discover the power of Hadoop and how it solves complex analytical problems it seems that there is a growing interest to quickly prototype new solutions - possibly on short lived or "throw away" cluster setups. Amazon's EC2 provides an ideal platform for such prototyping and there are a lot of great resources on how this can be done.

I would like to mention "Tracking Trends with Hadoop and Hive on EC2" on the Cloudera Blog by Pete Skomoroch and "Running Hadoop MapReduce on Amazon EC2 and Amazon S3" by Tom White. They give you full examples of how to process data stored on S3 using EC2 servers. Overall there seems to be a common need to quickly get insight into what a Hadoop and Hive based cluster can add in terms of business value.

Starting a Cluster Let's jump into it head first and solve the problem of actually launching a cluster. 1.$ ec2-describe-images -a | grep hadoop | wc -l That gets daunting very quickly. 02. 03. 04. 05. 06. 07. 08. 09. 10. 11. and 2.total 40 3. 4. EC2APITools. Amazon Elastic Compute Cloud. Apache Hadoop on Amazon EC2 | Grid Designer's Blog. AmazonEC2. Amazon EC2 (Elastic Compute Cloud) is a computing service. One allocates a set of hosts, and runs one's application on them, then, when done, de-allocates the hosts. Billing is hourly per host. Thus EC2 permits one to deploy Hadoop on a cluster without having to own and operate that cluster, but rather renting it on an hourly basis.

If you run Hadoop on EC2 you might consider using AmazonS3 for accessing job data (data transfer to and from S3 from EC2 instances is free). This document assumes that you have already followed the steps in Amazon's Getting Started Guide. Note that the older, manual step-by-step guide to getting Hadoop running on EC2 can be found here. Version 0.17 of Hadoop includes a few changes that provide support for multiple simultaneous clusters, quicker startup times for large clusters, and includes a pre-configured Ganglia installation. Preliminaries Concepts Amazon Machine Image (AMI), or image. Conventions Security Setting up Running a job on a cluster Troubleshooting.

Lineland: Hadoop on EC2 - A Primer.