background preloader

What is MapReduce

What is MapReduce
What is MapReduce? About MapReduce MapReduce is the heart of Hadoop®. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. Stay on top of all the changes including, Hadoop-based analytics, streaming analytics, warehousing (including BigSQL), data asset discovery, integration, and governance For people new to this topic, it can be somewhat difficult to grasp, because it’s not typically something people have been exposed to previously. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. An example of MapReduce Let’s look at a simple example. Toronto, 20 Whitby, 25 New York, 22 Rome, 32 Toronto, 4 Rome, 33 New York, 18 Out of all the data we have collected, we want to find the maximum tem­perature for each city across all of the data files (note that each file might have the same city represented multiple times). (Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)

3150-map-reduce-for-machine-learning-on-multicore Map-Reduce — MongoDB Manual 2.6.4 Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. For map-reduce operations, MongoDB provides the mapReduce database command. Consider the following map-reduce operation: In this map-reduce operation, MongoDB applies the map phase to each input document (i.e. the documents in the collection that match the query condition). All map-reduce functions in MongoDB are JavaScript and run within the mongod process. Note For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. Map-Reduce JavaScript Functions In MongoDB, map-reduce operations use custom JavaScript functions to map, or associate, values to a key. The use of custom JavaScript functions provide flexibility to map-reduce operations. Map-Reduce Behavior In MongoDB, the map-reduce operation can write results to a collection or return the results inline. MongoDB supports map-reduce operations on sharded collections.

10 things you should know about NoSQL databases The relational database model has prevailed for decades, but a new type of database -- known as NoSQL -- is gaining attention in the enterprise. Here's an overview of its pros and cons. For a quarter of a century, the relational database (RDBMS) has been the dominant model for database management. But, today, non-relational, "cloud," or "NoSQL" databases are gaining mindshare as an alternative model for database management. Note: This article is also available as a PDF download. Five advantages of NoSQL 1: Elastic scaling For years, database administrators have relied on scale up -- buying bigger servers as database load increases -- rather than scale out -- distributing the database across multiple hosts as load increases. RDBMS might not scale out easily on commodity clusters, but the new breed of NoSQL databases are designed to expand transparently to take advantage of new nodes, and they're usually designed with low-cost commodity hardware in mind. 2: Big data 4: Economics 1: Maturity

MapReduce Tutorial This section provides a reasonable amount of detail on every user-facing aspect of the MapReduce framework. This should help users implement, configure and tune their jobs in a fine-grained manner. However, please note that the javadoc for each class/interface remains the most comprehensive documentation available; this is only meant to be a tutorial. Let us first take the Mapper and Reducer interfaces. Applications typically implement them to provide the map and reduce methods. We will then discuss other core interfaces including JobConf, JobClient, Partitioner, OutputCollector, Reporter, InputFormat, OutputFormat, OutputCommitter and others. Finally, we will wrap up by discussing some useful features of the framework such as the DistributedCache, IsolationRunner etc. Payload Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. Mapper Mapper maps input key/value pairs to a set of intermediate key/value pairs. How Many Maps? Reducer Shuffle

MapReduce-MPI Library Oracle NoSQL Database Technical Overview The Oracle NoSQL Database is a distributed key-value database. It is designed to provide highly reliable, scalable and available data storage across a configurable set of systems that function as storage nodes. Data is stored as key-value pairs, which are written to particular storage node(s), based on the hashed value of the primary key. Storage nodes are replicated to ensure high availability, rapid failover in the event of a node failure and optimal load balancing of queries. Oracle NoSQL Driver links with the customer application, providing access to the data via appropriate storage node for the requested key. News! Need help getting started. Product Overview White Papers / Presentations Data Sheets Use Cases Online Tutorials / Videos Online Webinars Competitive Resources Additional Resources Partners Follow Us Sign up for NoSQL Database release announcements so we can alert you to future releases and other NoSQL Database product updates.

dean MySQL vs. MongoDB: Looking At Relational and Non-Relational Databases | Neon Rain Interactive When building a custom web application you need to consider the type of database that best suits the data. Here's a quick guide on the differences between MySQL (Relational) and MongoDB (Non-Relational / NoSQL). It was back in 2004 that Ruby on Rails first came out and popularized web application frameworks. What you might not know, is that it also popularized ORM (Object-Relational Mapping) layers with its ActiveRecord object. An ORM layer basically provides an object oriented interface to a relational database. That means that instead of writing a query to insert or update a record, you assign some properties to an object and call a save method. For example, if you have a "post" object that represents a blog post, you can access it's comments through the property "post.comments". Thankfully, we never jumped on to the ORM bandwagon. Data Representation MySQL represents data in tables and rows. MongoDB represents data as collections of JSON documents. Querying MongoDB uses object querying.

nosql - Non-Relational Database Design Apache Hadoop 2.5.1 - MapReduce Tutorial This section provides a reasonable amount of detail on every user-facing aspect of the MapReduce framework. This should help users implement, configure and tune their jobs in a fine-grained manner. However, please note that the javadoc for each class/interface remains the most comprehensive documentation available; this is only meant to be a tutorial. Let us first take the Mapper and Reducer interfaces. We will then discuss other core interfaces including Job, Partitioner, InputFormat, OutputFormat, and others. Finally, we will wrap up by discussing some useful features of the framework such as the DistributedCache, IsolationRunner etc. Payload Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. Mapper Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks that transform input records into intermediate records. Output pairs do not need to be of the same types as input pairs. Reducer Sort

Explaining Non-Relational Databases To My Mom | Ignored by Dinosaurs I was on the phone with Mom yesterday, and we got to talking about technology - a thing that actually happens fairly frequently. Being an only kid, she’s genuinely interested in everything that I do and it’s been helpful to have someone who’s mostly non-technical to bounce explanations off of when I’m getting my head around a new piece of gear. The piece of gear that I was explaining the other day was something called Mongo DB. Mongo’s parent company is called 10gen, and they landed on the startup scene about 5 years ago or so with their flagship product, Mongo DB. The Relational model The relational model of storing data has been around for more than 40 years. The classic example I gave to my mom was that of a common blog. The relational model typically comes into play when you visit a blog that has comments. Issues with the relational model For the purposes of this simplistic example, this hopefully isn’t that hard to get your head around. Very good. The non-relational model

Nonrelational Databases in a Big Data Environment Nonrelational databases do not rely on the table/key model endemic to RDBMSs (relational database management systems). In short, specialty data in the big data world requires specialty persistence and data manipulation techniques. Although these new styles of databases offer some answers to your big data challenges, they are not an express ticket to the finish line. One emerging, popular class of nonrelational database is called not only SQL (NoSQL). Originally the originators envisioned databases that did not require the relational model and SQL. As these products were introduced into the market, the definition softened a bit and now they are thought of as “not only SQL,” again bowing to the ubiquity of SQL. The other class is databases that do not support the relational model, but rely on SQL as a primary means of manipulating the data within.

Non-Relational-Database Technologies For over thirty years, relational database technology has been the gold standard. Modern workloads and unprecedented data volumes, however, are driving businesses to look at alternatives to the traditional relational database. This “NoSQL movement” has given rise to a host of non-relational-database technologies, designed for large-capacity storage and scalability. Some businesses may find that the best solution is a combination of both relational and non-relational databases—whichever tool is best for the job. In this regard, “NoSQL” is probably better referred to as, “Not Only SQL,” rather than “No SQL at all.” NoSQL technologies vary widely, but they can be evaluated based on three key features: scalability, data and query model, and persistence design. Scalability In this context, “scalability” refers to scaling writes by automatically partitioning data across multiple machines. When choosing a distributed database, look for: 1) support for multiple datacenters and Data and Query Model

NoSQL A relatively new concept in the world of database systems is the NoSQL DBMS. Just what is NoSQL? Well, I bet you could have guess that it doesn’t use SQL, right? Well, not exactly, at least not any more. The movement (and its name) is gaining popularity, but there isn’t exactly much rigor in terms of defining exactly what a NoSQL database system is, or what it must be able to do. At a high level, NoSQL implies non-relational, distributed, flexible, and scalable. There are a bunch of NoSQL “database systems” springing up. CouchDB - a document-oriented database that can be queried and indexed in a MapReduce fashion using JavaScript.

Related: