Hadoop

TwitterFacebook
Get flash to fully experience Pearltrees
EC2

In computer science , MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

MinHash

http://en.wikipedia.org/wiki/MinHash
http://alias-i.com/lingpipe/docs/api/com/aliasi/tokenizer/TokenNGramTokenizerFactory.html java.lang.Object com.aliasi.tokenizer.TokenNGramTokenizerFactory

TokenNGramTokenizerFactory (LingPipe API)

Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication , I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance. http://lingpipe-blog.com/2011/01/12/scaling-jaccard-distance-deduplication-shingling-minhash-locality-sensitive-hashi/

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing « LingPipe Blog

http://wiki.apache.org/hadoop/GettingStartedWithHadoop

GettingStartedWithHadoop - Hadoop Wiki

Note: for the 1.0.x series of Hadoop the following articles will probably be easiest to follow: The below instructions are primarily for the 0.2x series of Hadoop. Hadoop can be downloaded from one of the Apache download mirrors .
Introduction

Hadoop Tutorial

http://developer.yahoo.com/hadoop/tutorial/module2.html#programmatically

How to read all files in a directory in HDFS using Hadoop filesystem API - Hadoop and Hive

https://sites.google.com/site/hadoopandhive/home/how-to-read-all-files-in-a-directory-in-hdfs-using-hadoop-filesystem-api The following is the code to read all files in a directory in HDFS file system 1.
http://www.hackido.com/2010/05/install-hadoop-and-hive-on-ubuntu-lucid.html

Install Hadoop and Hive on Ubuntu Lucid Lynx

If you've got a need to do some map reduce work and decide to go with Hadoop and Hive, here's a brief tutorial on how to get it installed.
public class SortReducerByValues {

sort reducer input values in hadoop

http://riccomini.name/posts/hadoop/2009-11-13-sort-reducer-input-value-hadoop/

Using Hadoop’s DistributedCache - Nube Technologies

While working with Map Reduce applications, there are times when we need to share files globally with all nodes on the cluster. This can be a shared library to be accessed by each task, a global lookup file holding key value pairs, jars or archives containing executable code. http://nubetech.co/using-hadoops-distributedcache
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

Map Reduce Secondary Sort Does It All | Mawazo

i 5 Votes I came across a question in Stack Overflow recently related to calculating a web chat room statistics using Hadoop Map Reduce .
A6

CS246: Mining Massive Data Sets

Mining Massive Data Sets Winter 2011 Course information:
package forma ;

Hadoop input format for swallowing entire files.

Found New API Revised Classes of the Hadoop Definitive Guide Examples here by cyavvari Oct 11

Is the cluster set up correctly?

How to Benchmark a Hadoop Cluster