background preloader

MapReduce

Facebook Twitter

Small Data CRM marketing automation. To hear the pundits tell it, Big Data is going to revolutionize everything we do as marketers. By tapping into data on consumer behaviors ranging from where we shop to what apps we use and which comments we “like” on Facebook, Big Data can help marketers understand their audiences infinitely better than they did previously and therefore send more targeted, more personal communications custom tailored to each individual user. There’s a reason why IDC forecasts the market for Big Data to be $16.1 billion this year. And they just may be right. But the truth is, Big Data can also be expensive, cumbersome, and intimidating. The good news is that in many cases, Big Data isn’t always necessary. This is where Small Data comes in. Small Data relies on the data you already have in place to help you make smarter decisions regarding your targeting, messaging, and campaign strategies. Proving the value of Big Data and getting executive sign-off for Big Data analysis tools can often be tricky.

MapReduce Reduced (& Ported to R) « Open Data. Mapreduce. HadoopMapReduce. Why MapReduce matters to SQL data warehousing. August 26, 2008 Greenplum and Aster Data have both just announced the integration of MapReduce into their SQL MPP data warehouse products. So why do I think this could be a big deal? The short answer is “Because MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up.” The long answer goes something like this. The core ideas of MapReduce are: For large problems, parallel computing is much more cost effective and/or feasible than the alternatives.If you shoehorn programs into a certain very simple framework – namely that you’re limited to only having map and reduce steps — then building a general execution engine that gives parallelism “for free” is straightforward.A lot more problems can be solved within that framework than one might at first expect.

In essence, you can do almost anything to a single record* — that’s a map step. *Technically, MapReduce doesn’t allow for records. Some of our recent links about MapReduce Comments. Can MapReduce be Made Easy? Three big myths about MapReduce. October 18, 2009 Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths: MapReduce is something very newMapReduce involves strict adherence to the Map-Reduce programming paradigmMapReduce is a single technology So let’s give it a try. When Dave DeWitt and Mike Stonebraker leveled their famous blast at MapReduce, many people thought they overstated their case. True, what those companies were doing things may not have looked exactly like the instant-classic MapReduce programming paradigm. Here are some examples of what I mean, drawn from my recent MapReduce webinar. If you do text indexing in MapReduce, your goal is to wind up with a text index.

By no means do I think this is a weakness of the MapReduce programming paradigm. Finally: MapReduce, as commonly conceived, spans two different – albeit closely related – technology domains: Designs, Lessons and Advice from Building Large Distributed Syst. CS 61A Lecture 34: Mapreduce I. A practical scalable distributed B-tree. HowManyMapsAndReduces. Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures.

At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead. Number of Maps The number of maps is usually driven by the number of DFS blocks in the input files. Actually controlling the number of maps is subtle. The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). Number of Reduces The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum).

Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces << heapSize).