Beyond Hadoop: Next-Generation Big Data Architectures — Cloud Computing News
How to Include Third-Party Libraries in Your Map-Reduce Job
In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. MinHash
wfc0398-liPS.pdf (application/pdf Object)
java.lang.Object com.aliasi.tokenizer.TokenNGramTokenizerFactory TokenNGramTokenizerFactory (LingPipe API)
Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance. Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing « LingPipe Blog
GettingStartedWithHadoop - Hadoop Wiki Note: for the 1.0.x series of Hadoop the following articles will probably be easiest to follow: The below instructions are primarily for the 0.2x series of Hadoop. Hadoop can be downloaded from one of the Apache download mirrors.
Introduction Hadoop Tutorial
How to read all files in a directory in HDFS using Hadoop filesystem API - Hadoop and Hive The following is the code to read all files in a directory in HDFS file system 1.
ssh: connect to host localhost port 22: Connection refused
Install Hadoop and Hive on Ubuntu Lucid Lynx If you've got a need to do some map reduce work and decide to go with Hadoop and Hive, here's a brief tutorial on how to get it installed.
Overview (Hadoop 0.20.205.0 API)
sort reducer input values in hadoop
org.apache.mahout.clustering.minhash Class and Subpackage | www.massapi.com
Using Hadoop’s DistributedCache - Nube Technologies
Map Reduce Secondary Sort Does It All | Mawazo i 6 Votes I came across a question in Stack Overflow recently related to calculating a web chat room statistics using Hadoop Map Reduce.
Hadoop Resources | Cloudera Resources | Data Consolidation Resources | Cloudera
MapReduce Applications | Mendeley Group
The Hadoop Tutorial Series « Java. Internet. Algorithms. Ideas.
Graph partitioning in MapReduce with Cascading - Ware Dingen 29 January 2012
package forma; Hadoop input format for swallowing entire files.
Found New API Revised Classes of the Hadoop Definitive Guide Examples here by Oct 11
Is the cluster set up correctly? How to Benchmark a Hadoop Cluster
Top K is slightly more complicated (in comparison) to implement efficiently : you might want to look at other projects like pig to see how they do it (to compare and look at ideas). Re: setNumReduceTasks(1)
datawrangling/trendingtopics - GitHub
CS 61A Lecture 34: Mapreduce I
s Hadoop Demo VM - Cloudera Support