⊿ Point. {R} Glossary. ◢ Keyword: M. ▰ Sources. 〓 Books [B] ◥ University. {q} PhD. ⏫ THEMES. ⏫ Big Data. [B] Big Data. ⚫ USA. ↂ EndNote. ☝️ BD Dummies. MapReduce. Overview[edit] MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead.
A MapReduce framework (or system) is usually composed of three operations (or steps): Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. MapReduce allows for the distributed processing of the map and reduction operations. Another way to look at MapReduce is as a 5-step parallel and distributed computation: