background preloader

MapReduce Patterns

MapReduce Patterns
In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below. MapReduce Framework Counting and Summing Problem Statement: There is a number of documents where each document is a set of terms. Solution: Let start with something really simple. The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage Combiners: Applications: Log Analysis, Data Querying Collating Problem Statement: There is a set of items and some function of one item. The solution is straightforward.

http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Related:  BigData

The NoSQL movement In a conversation last year, Justin Sheehy, CTO of Basho, described NoSQL as a movement, rather than a technology. This description immediately felt right; I’ve never been comfortable talking about NoSQL, which when taken literally, extends from the minimalist Berkeley DB (commercialized as Sleepycat, now owned by Oracle) to the big iron HBase, with detours into software as fundamentally different as Neo4J (a graph database) and FluidDB (which defies description). But what does it mean to say that NoSQL is a movement rather than a technology? We certainly don’t see picketers outside Oracle’s headquarters. Justin said succinctly that NoSQL is a movement for choice in database architecture. There is no single overarching technical theme; a single technology would belie the principles of the movement.

R (programming language) R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software[2][3] and data analysis.[3] Polls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years.[4][5][6][7] R is highly extensible through the use of user-submitted packages for specific functions or specific areas of study. Due to its S heritage, R has stronger object-oriented programming facilities than most statistical computing languages. Extending R is also eased by its lexical scoping rules.[19] Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols.

Blueprint for a Big Data Solution In today’s world, data is money. Companies are scrambling to collect as much data as possible, in an attempt to find hidden patterns that can be acted upon to drive revenue. However, if those companies aren’t using that data, and they’re not analyzing it to find those hidden gems, the data is worthless. One of the most challenging tasks when getting started with Hadoop and building a big data solution is figuring out how to take the tools you have and put them together. The Hadoop ecosystem encompasses about a dozen different open-source projects. How do we pick the right tools for the job? NoSQL Data Modeling Techniques NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like the CAP theorem apply well to NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks the systematic theory found in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

Storage area network A SAN does not provide file abstraction, only block-level operations. However, file systems built on top of SANs do provide file-level access, and are known as SAN filesystems or shared disk file systems. Storage[edit] Historically, data centers first created "islands" of SCSI disk arrays as direct-attached storage (DAS), each dedicated to an application, and visible as a number of "virtual hard drives" (i.e. LUNs).[1] Essentially, a SAN consolidates such storage islands together using a high-speed network. Dempsy – a New Real-time Framework for Processing BigData Real time processing of BigData seems to be one of the hottest topics today. Nokia has just released a new open-source project - Dempsy. Dempsy is comparable to Storm, Esper, Streambase, HStreaming and Apache S4. The code is released under the Apache 2 license Dempsy is meant to solve the problem of processing large amounts of "near real time" stream data with the lowest lag possible; problems where latency is more important that "guaranteed delivery."

Why Do Hacker Prefer LINUX? Linux use is growing at an amazing rate. This operating system, which has no public relations department, advertising, or government lobby, is being used widely in homes and server rooms alike. It’s also free, and 100% open source, meaning anyone can look at each and every line of code in the Linux kernel. Linux is a true multiuser operating system, and has been since the very first version. It is powerful in it’s simplicity. Though there are robust graphical environments and tools, you can still do everything you could possibly need with just a keyboard and a shell prompt.

Modern Telecom Architectures Built with Hadoop This is the third in our series on modern data architectures across industry verticals. Others in the series are: Many of the world’s largest telecommunications companies use Hortonworks Data Platform (HDP) to manage their data. Through partnership with these companies, we have learned how our customers use HDP to improve customer satisfaction, make better infrastructure investments and develop new products. Hortonworks partner Teradata recently gave some use case examples in this video about how Verizon Wireless uses Teradata in combination with Hortonworks Data Platform to keep their customer churn below 1%. Rob Smith, Verizon Wireless’ Executive Director for IT, describes how his team uses their discovery platform to improve customer interactions, by:

Shellshock: The 'Bash Bug' That Could Be Worse Than Heartbleed Security researchers have discovered a vulnerability in the system software used in millions of computers, opening the possibility that attackers could execute arbitrary commands on web servers, other Linux-based machines and even Mac computers. Some researchers say Shellshock, which affects an application called Bash (which is why it's often simply called the "Bash Bug"), is potentially more serious and widespread than the Heartbleed bug discovered in April, though the two vulnerabilities are quite different in nature. Unlike Heartbleed, which forced users to change their passwords for various Internet services, Shellshock doesn't appear to have any easy solutions for average users right now.

Hadoop MapReducers in .NET - Making C# and VB First Class Citizens in Hadoo We Recommend These Resources Hadoop was written in Java. So it makes sense that Java will always be Hadoop's best friend.

Related: