MapReduce Patterns, Algorithms, and Use Cases « Highly Scalable

In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below. MapReduce Framework Counting and Summing Problem Statement: There is a number of documents where each document is a set of terms. Solution: Let start with something really simple. The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage Combiners: Applications: Log Analysis, Data Querying Collating Problem Statement: There is a set of items and some function of one item. The solution is straightforward.

Why scaling horizontally is better: Fat is the new Fit Scaling, in terms of the internet, is a product or service’s ability to expand exponentially to meet need. There are two types of scalability: vertical scalability is the traditional and easiest was to expand – by upgrading the hardware you already own, and horizontal scalability is where you create a network of hardware which can expand (and contract) to suit demand at a given time. Let’s all face it: the internet is only going to get bigger. Recent events such as the uprisings of Egypt and Libya have proved without doubt that the third world is beginning to utilise the internet in the same way the west has been. Companies are using their servers to calculate evermore complex statistics from ever larger stores of raw data. Many websites have succumbed to the “digg” or “slashdot” effect, whereby their server's traffic or resources suddenly spike to the point where the server cannot handle the load. So what does it mean to scale? Why bother scaling horizontally? Straightforward?

CS 61A Home Page Course Resources Contest Results You have selected the winners of the Recursion Exposition! Here they are: The results from the Pig contest are in! Our top finishers out of 34 entries: Other Useful Information Course Schedule About Viewing Documents Course documents available through these Web pages are either plain text files, Postscript files, or PDF (Portable Document Format) files.

Big Data analytics with Hive and iReport Each J.J. Abrams’ TV series Person of Interest episode starts with the following narration from Mr. Finch one of the leading characters: “You are being watched. The government has a secret system–a machine that spies on you every hour of every day. In JCG article “Hadoop Modes Explained – Standalone, Pseudo Distributed, Distributed” JCG partner Rahul Patodi explained how to setup Hadoop. In this article we will set up a Hive Server, create a table, load it with data from a text file and then create a Jasper Resport using iReport. Note: I used Hadoop version 0.20.205, Hive version 0.7.1 and iReport version 4.5 running OpenSuSE 12.1 Linux with MySQL 5.5 installed. Assuming you have already installed Hadoop download and install Hive following the Hive Getting Started wiki instructions. Making a multiuser Hive metastore The default Hive install uses a derby embedded database as its metastore. Now let’s populate Hadoop Hive with some data Copy these files from your local filesystem to HDFS: 1.

self improvement - What is the single most effective thing you did to improve your programming skills MapReduce & Hadoop API revised | Datasalt Nowadays, Hadoop has become the key technology behind what has come to be known as “Big Data”. It has certainly worked hard to earn this position. It is mature technology that has been used successfully in countless projects. But now, with experience behind us, it is time to take stock of the foundations upon which it is based, particularly its interface. This article discusses some of the weaknesses of both MapReduce and Hadoop, which we, at Datasalt, shall attempt to resolve with an open-source project that we will soon be releasing. MapReduce MapReduce is the distributed computing paradigm implemented by Hadoop. Experience has shown us that the setup proposed by MapReduce for data processing creates difficulties for a series of issues that are quite common to any Big Data project. Compound records Key/value files are sufficient for implementing the typical WordCount, for example, since only two types of data per file are needed: a string for the word and an integer for the counter.

Color Hex - ColorHexa.com Hypertable Routs HBase in Performance Test - HBase Overwhelmed by Garbage Collection | High Scalability This is a guest post by Doug Judd, original creator of Hypertable and the CEO of Hypertable, Inc. Hypertable delivers 2X better throughput in most tests -- HBase fails 41 and 167 billion record insert tests, overwhelmed by garbage collection -- Both systems deliver similar results for random read uniform test We recently conducted a test comparing the performance of Hypertable (@hypertable) version 0.9.5.5 to that of HBase (@HBase) version 0.90.4 (CDH3u2) running Zookeeper 3.3.4. Introduction Hypertable and HBase are both open source, scalable databases modeled after Google's proprietary Bigtable database. OS: CentOS 6.1 CPU: 2X AMD C32 Six Core Model 4170 HE 2.1Ghz RAM: 24GB 1333 MHz DDR3 disk: 4X 2TB SATA Western Digital RE4-GP WD2002FYPS The HDFS NameNode and Hypertable and HBase master was run on test01. Random Write In this test we wrote 5TB of data into both Hypertable and HBase in four different runs, using value sizes 10000, 1000, 100, and 10. Random Read Zipfian Uniform Conclusion

Hidden features of Python Scaling A Web App 1,000x in 3 Days | William Hertling's Thoughtstream When I'm not writing science fiction novels, I work on web apps for a largish company*. This week was pretty darn exciting: we learned on Monday afternoon that we needed to scale up to a peak volume of 10,000 simultaneous visitors by Thursday morning at 5:30am. This post will be about what we did, what we learned, and what worked. A little background: our stack is nginx, Ruby on Rails, and mysql. Our site was in beta, and we've been getting a few hundred visitors a day. What we had going for us: We were already running on Amazon Web Services (AWS), using EC2 behind a load balancer.We already had auto-scale rules in place to detect when when CPU load exceeded a certain percentage on our app servers, and to double the number of EC2 instances we had.We had reliable process for deploying (although this became a constraint later).A month before we had spent a week optimizing performance of our app, and we had already cut page load time by 30%, largely by reducing database calls.

Message Queue Evaluation Notes From Second Life Wiki Second Life Wiki > Message Queue Evaluation Notes One of the infrastructure tools that we've identified for the future internal architecture of Second Life is messaging. Message queuing systems allow systems that send messages to not have to worry about how they will be delivered, and allow consumers of messages to gather whichever ones interest them, at their own pace. Ideally we'd have a completely scaleable system that clients could treat as singular black box. Our use cases mostly involve very large numbers of queues; the smallest number we're even considering is double the number of concurrent users. In any case, given that we expect that we'd have to develop our own queue scaling solution that involves partitioning, which may or may not be a task we are interested in taking on, the strongest candidates are RabbitMQ and Apache QPID. Criteria Questions The major use cases that we believe could be implemented with message queues are these: Group Chat Use Case RabbitMQ