background preloader

Hadoop

Facebook Twitter

Beyond Hadoop: Next-Generation Big Data Architectures — Cloud Computing News. After 25 years of dominance, relational databases and SQL have in recent years come under fire from the growing “NoSQL movement.”

Beyond Hadoop: Next-Generation Big Data Architectures — Cloud Computing News

A key element of this movement is Hadoop, the open-source clone of Google’s internal MapReduce system. Whether it’s interpreted as “No SQL” or “Not Only SQL,” the message has been clear: If you have big data challenges, then your programming tool of choice should be Hadoop. The only problem with this story is that the people who really do have cutting edge performance and scalability requirements today have already moved on from the Hadoop model. A few have moved back to SQL, but the much more significant trend is that, having come to realize the capabilities and limitations of MapReduce and Hadoop, a whole raft of radically new post-Hadoop architectures are now being developed that are, in most cases, orders of magnitude faster at scale than Hadoop.

How to Include Third-Party Libraries in Your Map-Reduce Job. “My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you.

How to Include Third-Party Libraries in Your Map-Reduce Job

Java requires third-party and user-defined classes to be on the command line’s “-classpath” option when the JVM is launched. The `hadoop` wrapper shell script does exactly this for you by building the classpath from the core libraries located in /usr/lib/hadoop-0.20/ and /usr/lib/hadoop-0.20/lib/ directories.

EC2

MinHash. In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

MinHash

The scheme was invented by Andrei Broder (1997),[1] and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.[2] It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.[1] Jaccard similarity and minimum hash values[edit] The Jaccard similarity coefficient of two sets A and B is defined to be It is a number between 0 and 1; it is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Wfc0398-liPS.pdf (application/pdf Object) TokenNGramTokenizerFactory (LingPipe API) Java.lang.Object com.aliasi.tokenizer.TokenNGramTokenizerFactory All Implemented Interfaces: TokenizerFactory, Serializable public class TokenNGramTokenizerFactoryextends Objectimplements TokenizerFactory, Serializable A TokenNGramTokenizerFactory wraps a base tokenizer to produce token n-gram tokens of a specified size.

TokenNGramTokenizerFactory (LingPipe API)

For example, suppose we have a regex tokenizer factory that generates tokens based on contiguous letter characters. TokenizerFactory tf = new RegExTokenizerFactory("\\S+"); TokenizerFactory ntf = new TokenNGramTokenizerFactory(2,3,tf); The sequences of tokens produced by tf for some inputs are as follows. The start and end positions are calculated based on the positions for the base tokens provided by the base tokenizer. Thread Safety. Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing « LingPipe Blog. Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance.

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing « LingPipe Blog

The Book Check out Chapter 3, Finding Similar Items, from: GettingStartedWithHadoop. Note: for the 1.0.x series of Hadoop the following articles will probably be easiest to follow: The below instructions are primarily for the 0.2x series of Hadoop.

GettingStartedWithHadoop

Hadoop can be downloaded from one of the Apache download mirrors. Hadoop Tutorial. Introduction HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.

Hadoop Tutorial

Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it. Goals for this Module: Understand the basic design of HDFS and how it relates to basic distributed file system concepts Learn how to set up and use HDFS from the command line Learn how to use HDFS in your applications. How to read all files in a directory in HDFS using Hadoop filesystem API - Hadoop and Hive. The following is the code to read all files in a directory in HDFS file system 1.

How to read all files in a directory in HDFS using Hadoop filesystem API - Hadoop and Hive

Open File cat.java and paste the following code package org.myorg;import java.io. *;import java.util. *;import java.net. public class cat{ public static void main (String [] args) throws Exception{ try{ Ssh: connect to host localhost port 22: Connection refused. Install Hadoop and Hive on Ubuntu Lucid Lynx. Www.usenix.org/publications/login/2010-02/openpdfs/leidner.pdf.

Overview (Hadoop 0.20.205.0 API) Sort reducer input values in hadoop. Org.apache.mahout.clustering.minhash Class and Subpackage. Using Hadoop’s DistributedCache - Nube Technologies. Map Reduce Secondary Sort Does It All. I came across a question in Stack Overflow recently related to calculating a web chat room statistics using Hadoop Map Reduce.

Map Reduce Secondary Sort Does It All

The answer to the question was begging for a solution based map reduce secondary sort. I will provide details, along with code snippet, to complement my answer to the question. The Problem The data consists of a time stamp, chat room zone and number of users. Data Consolidation Resources.

A6

MapReduce Applications. Apache Mahout: Scalable machine learning and data mining. The Hadoop Tutorial Series « Java. Internet. Algorithms. Ideas. Graph partitioning in MapReduce with Cascading - Ware Dingen. 29 January 2012 I have recently had the joy of doing MapReduce based graph partitioning.

Graph partitioning in MapReduce with Cascading - Ware Dingen

Here's a post about how I did that. I decided to use Cascading for writing my MR jobs, as it is a lot less verbose than raw Java based MR. The graph algorithm consists of one step to prepare the input data and then a iterative part, that runs until convergence. The program uses a Hadoop counter to check for convergence and will stop iterating once there. This is only true when you visualize the graph. Now, what we aim to do is tag each edge in the list with a partition number. The algorithm that we will apply is this: Turn the edge lists into an adjacency list representation Tag each source node + adjacency list with a partition ID equal to the largest node ID in the record. 1. For the first step, we need to change the representation of the graph into adjacency lists. Group by source nodeFor every group, output the source node and a list of all target nodes. Atbrox. Hadoop input format for swallowing entire files.

Package forma; import java.io.IOException;

Found New API Revised Classes of the Hadoop Definitive Guide Examples here – cyavvari

How to Benchmark a Hadoop Cluster. Is the cluster set up correctly? The best way to answer this question is empirically: run some jobs and confirm that you get the expected results. Benchmarks make good tests, as you also get numbers that you can compare with other clusters as a sanity check on whether your new cluster is performing roughly as expected. And you can tune a cluster using benchmark results to squeeze the best performance out of it. This is often done with monitoring systems in place, so you can see how resources are being used across the cluster. SetNumReduceTasks(1) Top K is slightly more complicated (in comparison) to implement efficiently : you might want to look at other projects like pig to see how they do it (to compare and look at ideas). Just to get an understanding - your mappers generate <key, value>, and you want to pick top K based on value in reducer side ? Or can you have multiple key's coming in from various mappers and you need to aggregate it at reducer ?

If former (that is key is unique), then a combiner to emit's top K per mapper, and then a single reducer which sorts and picks from the M * C * K tuples should do the trick (M == number of mappers, C == avg number of combiner invocations per mapper, K == number of output tuples required). Datawrangling/trendingtopics - GitHub. CS 61A Lecture 34: Mapreduce I.