background preloader

Parallel

Facebook Twitter

Akka Project. Rio - Welcome to the Rio Project. MapReduce. MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster. The core concepts are described in Dean and Ghemawat. The Map A map transform is provided to transform an input data row of key and value to an output key/value: map(key1,value) -> list<key2,value2> That is, for an input it returns a list containing zero or more (key,value) pairs: The output can be a different key from the input The output can have multiple entries with the same key The Reduce A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output. reduce(key2, list<value2>) -> list<value3> The MapReduce Engine The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different keys and lists of data.

A distributed filesystem spreads multiple copies of the data across different machines. Limitations. Mriap2008. Shared nothing architecture. A shared nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage. Shared nothing is popular for web development because of its scalability. As Google has demonstrated, a pure SN system can scale almost infinitely simply by adding nodes in the form of inexpensive computers, since there is no single bottleneck to slow the system down.[4] Google calls this sharding.

A SN system typically partitions its data among many nodes on different databases (assigning different computers to deal with different users or queries), or may require every node to maintain its own copy of the application's data, using some kind of coordination protocol. This is often referred to as database sharding. Shared nothing architectures have become prevalent in the data warehousing space. What is shared? See also[edit]