background preloader

Mapreduce

Facebook Twitter

Haloop - Project Hosting on Google Code. Why do we develop the HaLoop project? The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. However, these new platforms do not have built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph processing, model fitting, and so on. What is HaLoop? Simply speaking, HaLoop = Ha, Loop:-) HaLoop is a modified version of the Hadoop MapReduce framework, designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, but also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms.

Get started: HaLoop publications Contact Yingyi Bu (buyingyi@gmail.com) Mesos's spark at master - GitHub. Designs, Lessons and Advice from Building Large Distributed Syst. The tech behind 236 eHarmony members getting hitched daily. News Analysis By Eric Lai September 16, 2009 06:32 PM ET Computerworld - While eHarmony Inc.'s goal is to get its 20 million members married or into long-term relationships, the online matchmaker is a downright commitment-phobe in its use of technology. For the business intelligence infrastructure that powers its matchmaking algorithms and maximizes the effectiveness of its numerous TV ads, the firm relies on four database and data warehousing products. They include Oracle Database, the open-source MySQL database, another open-source data-crunching app, Hadoop, and data warehousing appliances from Netezza Inc.

For some IT managers, managing four such disparate products wouldn't be worth the trouble. "We always use multiple vendors for different things," Essas told the audience during his speech Wednesday at Computerworld's Business Intelligence Perspectives conference in Chicago. Essas say he likes the "leverage from playing multiple people against each other. " A practical scalable distributed B-tree. Ivory: A Hadoop toolkit for Web-scale information retrieval.

Ivory is a Hadoop toolkit for web-scale information retrieval research that features a retrieval engine based on Markov Random Fields, appropriately named SMRF (Searching with Markov Random Fields). Ivory takes full advantage of the Hadoop distributed environment (the MapReduce programming model and the underlying distributed file system) for both indexing and retrieval. The current release of Ivory (release 0.4) works with Hadoop (release 0.20.1). In order to temper expectations, please note that Ivory is not meant to serve as a full-featured search engine, but rather aimed at information retrieval researchers who generally know their way around retrieval algorithms.

As a result, a lot of "niceties" are simply missing—for example, fancy interfaces or ingestion support for different file types. It goes without saying that Ivory is a bit rough around the edges, but our philosophy is to release early and release often. In short, Ivory is experimental ! HowManyMapsAndReduces. Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop.

Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead. Number of Maps The number of maps is usually driven by the number of DFS blocks in the input files. Actually controlling the number of maps is subtle. The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num).

Number of Reduces The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces << heapSize). Hadoop &amp; Big Data Blog » Blog Archive » The Second Hadoop UK. Last Tuesday – on my second day of work at Cloudera – I went to London to check out the second UK Hadoop User Group meetup, kindly hosted by Sun in a nice meeting room not far from the river Thames.

We saw a day of talks from people heavily involved with Hadoop, both on the development and usage side and more often than not a bit of both. It was a great opportunity to put a selection of people all interested in Hadoop technology in the same room and find out what the current status and future directions of the project are. There were around 55 attendees from a variety of organisations, both academic and professional. Tom White and I were there representing Cloudera, and there were attendees from Microsoft, HP, the Apache Software Foundation and the incredibly fashionable guys from Last.fm. The slides and talks have been made available by the organisers here – they’re well worth checking out if you want to get a cross-section of some current activity around Hadoop. Practical MapReduce. Mahout - Overview. Hadoop User Group UK: HUGUK #2 - Wrap up.

Posted on 10:59 by Johan Oskarsson and filed under hadoop, huguk, skillsmatter, sun The 14th of April marked the first Hadoop User Group meetup in 2009. The day was packed with interesting talks about Hadoop projects and real world Hadoop use. If you missed any of them you can have a look at the videos and slides below. Not all of them are available yet, but should be shortly. Practical MapReduce - (Tom White, Cloudera) video, slidesIntroducing Apache Mahout - (Isabel Drost, ASF) video, slidesTerrier - (Iadh Ounis and Craig Macdonald, University of Glasgow) videoHaving fun with PageRank and MapReduce - (Paolo Castagna, HP) video, slidesApache HBase - (Michael Stack, Powerset) video, slidesHypercubes in HBase - (Fredrik Möllerstrand, Last.fm) slidesHADOOP-1722 and typed bytes - (Klaas Bosteels, Last.fm) slidesScalable reasoning on RDF documents with Hadoop and HBase - (Michele Catasta) slides Thanks to everyone who presented and Skills matter for filming.

Hadoop &amp; Big Data Blog » general.