Hadoop en 5 questions Quel est le principe de fonctionnement de cette infrastructure de traitement massif de données ? Quelles sont ses principales briques ? Quid des premières applications ? 1 - Qu'est ce qu'Hadoop ? Il s'agit d'un framework Open Source conçu pour réaliser des traitements sur des volumes de données massifs, de l'ordre de plusieurs petaoctets (soit plusieurs milliers de To). Hadoop a été conçu par Doug Cutting en 2004. Yahoo! 2 - Quel est le principe de fonctionnement de ce framework de traitement intensif ? Dans une logique d'architecture Hadoop, cette liste est découpée en plusieurs parties, chaque partie étant stockée sur une grappe de serveurs différente. 3 - Quelles en sont les différentes briques ? Poursuivons notre exemple. Les réseaux sociaux Facebook, Twitter et Linkedin repose sur Hadoop En aval, la distribution et la gestion des calculs est réalisé par MapReduce. Map qui s'applique sur une liste d'éléments. 4 - Au-delà de Yahoo! 5 - Quelles sont les applications possibles d'Hadoop ?
Running Hadoop On Ubuntu Linux (Single-Node Cluster) @ Michael G. Noll In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux. Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it. This tutorial has been tested with the following software versions: Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04) Hadoop 1.0.3, released May 2012 Sun Java 6 Disabling IPv6
Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison :: Software architect Kristof Kovacs (Yes it's a long title, since people kept asking me to write about this and that too :) I do when it has a point.) While SQL databases are insanely useful tools, their monopoly in the last decades is coming to an end. And it's just time: I can't even count the things that were forced into relational databases, but never really fitted them. (That being said, relational databases will always be the best for the stuff that has relations.) But, the differences between NoSQL databases are much bigger than ever was between one SQL database and another. In this light, here is a comparison of Open Source NOSQL databases Cassandra, Mongodb, CouchDB, Redis, Riak, RethinkDB, Couchbase (ex-Membase), Hypertable, ElasticSearch, Accumulo, VoltDB, Kyoto Tycoon, Scalaris, OrientDB, Aerospike, Neo4j and HBase: The most popular ones Redis (V3.2) Best used: For rapidly changing data with a foreseeable database size (should fit mostly in memory). For example: To store real-time stock prices. Cassandra (2.0)
The Apache Cassandra Project Apache Mahout: Scalable machine learning and data mining Welcome to Apache™ Hadoop®! Installing Ubuntu inside Windows using VirtualBox This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. The screenshots in this tutorial use Ubuntu 12.04, but the same principles apply also to Ubuntu 12.10, 11.10, 10.04, and any future version of Ubuntu. Actually, you can install pretty much any Linux distribution this way. Introduction VirtualBox allows you to run an entire operating system inside another operating system. Please be aware that you should have a minimum of 512 MB of RAM. 1 GB of RAM or more is recommended. Comparison to Dual-Boot Many websites (including the one you're reading) have tutorials on setting up dual-boots between Windows and Ubuntu. Advantages of virtual installation The size of the installation doesn't have to be predetermined. Follow these instructions to get a Ubuntu disk image (.iso file). After you launch VirtualBox from the Windows Start menu, click on New to create a new virtual machine. You can call the machine whatever you want. Click Next. Click Next again.
Survey distributed databases - Toad for Cloud Wiki Overview Top This document, researched and authored by Quest's chief software architect Randy Guck, provides a summary of distributed databases. These are commercial products, open source projects, and research technologies that support massive data storage (petabyte+) using an architecture that distributes storage and processing across multiple servers. These can be considered “Internet age” databases that are being used by Amazon, Facebook, Google and the like to address performance and scalability requirements that cannot be met by traditional relational databases. Distributed Database Concepts This section describes concepts that constitute the nature of modern distributed databases. NoSQL Databases Meaning “no SQL”, this is a term that casually describes the new breed of databases that are appearing largely in response to the limitations of existing relational databases. Schema-less: “Tables” don’t have a pre-defined schema. Database Types by Entity Type Distributed Memory Caches
HBase - Apache HBase Home Tez -