background preloader

Hadoop

Facebook Twitter

Crawler

Joey Mazzarelli » Blog Archive » Nutch and Hadoop (as user with NFS) Posted: July 25th, 2007 | Author: Joey | Filed under: Thesis | 7 Comments » This is from with some modifications to fit my needs. The article above assumes you have root access, which should be the case if you are going to consume the resources needed to crawl the internet. However, I want to run this as a normal user. Another gotcha is that I am working on a cluster that shares users’ home directories over NFS. For more details or explanation, refer to the original article. First, I need to be able to login to all the various nodes on the cluster through SSH without being prompted for a password. ssh-keygen -t rsa #Leave the password empty.cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Next, get Nutch built. mkdir -p {~/src/nutch,~/opt/nutch/build}cd ~/src/nutch svn co . echo dist.dir=$HOME/opt/nutch/build > build.properties ant package cd ~/opt/nutch/cp -r build crawler cp -r build sandbox.

ACM Queue - Building Nutch: Open Source Search: A case study in writing an open source search engine. Python + Hadoop = Flying Circus Elephant. As a research intern here at Last.fm, dealing with huge datasets has become my daily bread. Having a herd of yellow elephants at my disposal makes this a lot easier, but the conventional way of writing Hadoop programs can be rather cumbersome. It generally involves lots of typing, compiling, building, and moving files around, which is especially annoying for the “write once, run never again” programs that are often required for research purposes. After fixing one too many stupid bugs caused by this copy/paste-encouraging work flow, I finally decided to look for an alternative.

The sluggishness of the conventional Hadoop work flow is mainly caused by the fact that Java is a very verbose and compiled programming language. Hence, finding a solution basically boils down to replacing Java. Def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values) if __name__ == "__main__": import dumbo dumbo.run(mapper,reducer) Free Search. Nutch setup and use.