Beyond Hadoop: Next-Generation Big Data Architectures — Cloud Computing News After 25 years of dominance, relational databases and SQL have in recent years come under fire from the growing “NoSQL movement.” A key element of this movement is Hadoop, the open-source clone of Google’s internal MapReduce system. Whether it’s interpreted as “No SQL” or “Not Only SQL,” the message has been clear: If you have big data challenges, then your programming tool of choice should be Hadoop. The only problem with this story is that the people who really do have cutting edge performance and scalability requirements today have already moved on from the Hadoop model. A few have moved back to SQL, but the much more significant trend is that, having come to realize the capabilities and limitations of MapReduce and Hadoop, a whole raft of radically new post-Hadoop architectures are now being developed that are, in most cases, orders of magnitude faster at scale than Hadoop.
How to Include Third-Party Libraries in Your Map-Reduce Job “My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you. Java requires third-party and user-defined classes to be on the command line’s “-classpath” option when the JVM is launched. The `hadoop` wrapper shell script does exactly this for you by building the classpath from the core libraries located in /usr/lib/hadoop-0.20/ and /usr/lib/hadoop-0.20/lib/ directories.
In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words. Jaccard similarity and minimum hash values The Jaccard similarity coefficient of two sets A and B is defined to be MinHash
wfc0398-liPS.pdf (application/pdf Object)
java.lang.Object com.aliasi.tokenizer.TokenNGramTokenizerFactory All Implemented Interfaces: TokenizerFactory, Serializable public class TokenNGramTokenizerFactoryextends Objectimplements TokenizerFactory, Serializable A TokenNGramTokenizerFactory wraps a base tokenizer to produce token n-gram tokens of a specified size. TokenNGramTokenizerFactory (LingPipe API)
Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance. The Book Check out Chapter 3, Finding Similar Items, from: Rajaraman, Anand and Jeff Ullman. 2010 (Draft). Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing « LingPipe Blog
GettingStartedWithHadoop - Hadoop Wiki Note: for the 1.0.x series of Hadoop the following articles will probably be easiest to follow: The below instructions are primarily for the 0.2x series of Hadoop. Hadoop can be downloaded from one of the Apache download mirrors.
Introduction HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it. Goals for this Module: Understand the basic design of HDFS and how it relates to basic distributed file system concepts Learn how to set up and use HDFS from the command line Learn how to use HDFS in your applications Hadoop Tutorial
How to read all files in a directory in HDFS using Hadoop filesystem API - Hadoop and Hive The following is the code to read all files in a directory in HDFS file system 1. Open File cat.java and paste the following code package org.myorg;import java.io.*;import java.util.*;import java.net.
ssh: connect to host localhost port 22: Connection refused
Install Hadoop and Hive on Ubuntu Lucid Lynx If you've got a need to do some map reduce work and decide to go with Hadoop and Hive, here's a brief tutorial on how to get it installed. This is geared more towards local development work than a standalone server so be careful to use best practices if you decide to deploy this live. This tutorial assumes you're running Ubuntu Lucid Lynx but it could work for other Debian based distros as well. Read on to get started! Step 1: Enable multiverse repo and get packages The first thing we need to do is make sure we've got multiverse repos installed.
January 2014 ;login: logout Published every other month ;login: logout will appear during the months when ;login: magazine is not published, giving you ;login: content year round. Each issue will contain at least three new articles. The January 2014 issue features: www.usenix.org/publications/login/2010-02/openpdfs/leidner.pdf
Overview (Hadoop 0.20.205.0 API)
sort reducer input values in hadoop
org.apache.mahout.clustering.minhash Class and Subpackage | www.massapi.com
Using Hadoop’s DistributedCache - Nube Technologies
Map Reduce Secondary Sort Does It All | Mawazo I came across a question in Stack Overflow recently related to calculating a web chat room statistics using Hadoop Map Reduce. The answer to the question was begging for a solution based map reduce secondary sort. I will provide details, along with code snippet, to complement my answer to the question.
Hadoop Resources | Cloudera Resources | Data Consolidation Resources | Cloudera
MapReduce Applications | Mendeley Group
The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries. Mahout currently has User and Item based recommenders Matrix factorization based recommenders K-Means, Fuzzy K-Means clustering Latent Dirichlet Allocation Singular value decomposition Logistic regression based classifier Complementary Naive Bayes classifier Random forest decision tree based classifier High performance java collections (previously colt collections) A vibrant community With scalable we mean:
The Hadoop Tutorial Series « Java. Internet. Algorithms. Ideas.
Graph partitioning in MapReduce with Cascading - Ware Dingen 29 January 2012 I have recently had the joy of doing MapReduce based graph partitioning. Here's a post about how I did that. I decided to use Cascading for writing my MR jobs, as it is a lot less verbose than raw Java based MR.
atbrox from mrjob.job import MRJob from mrjob.protocol import RawProtocol import json import sys import logging
Hadoop input format for swallowing entire files.
How to Benchmark a Hadoop Cluster
datawrangling/trendingtopics - GitHub
CS 61A Lecture 34: Mapreduce I