background preloader

Hadoop

Facebook Twitter

Reader (121) How to Program MapReduce Jobs in Hadoop with R. La NSA soumet sa base de données Hadoop à la Fondation Apache. Le doux vent de l'été souffle encore sur Hadoop.

La NSA soumet sa base de données Hadoop à la Fondation Apache

Ce framework Java Open Source pour le développement de systèmes de fichiers distribués et de gestion de données en volume a reçu cette semaine un nouvel allié de poids : la NSA (National Security Agency). L'agence de sécurité nationale américaine, dont une des missions clés est de collecter, d'analyser et de surveiller les communications militaires, gouvernementales, commerciales et personnelles des Etats-Unis, a soumis son projet de base de données NoSQL Accumulo à la très populaire fondation Apache. Le projet, aujourd'hui placé dans l'incubateur de la fondation Open Source, doit désormais trouver sa communauté. Démarré en 2008, Accumulo, riche de quelque 200 000 lignes de code (Java essentiellement), est le résultat de trois années de développement initiées par la NSA, explique l'agence sur le site de la fondation. Plus d'actualités et de tutoriels. Java development 2.0: Big data analysis with Hadoop MapReduce. When Google launched its image search feature in 2001, it had 250 million indexed images.

Java development 2.0: Big data analysis with Hadoop MapReduce

Less than a decade later, the search giant has indexed over 10 billion images. Thirty-five hours of content are uploaded to YouTube every minute. Twitter is said to handle, on average, 55 million tweets per day. Earlier this year, its search feature was logging 600 million queries daily. That is what we mean when we talk about big data. Data on such a massive scale was once limited to large corporations, universities, and governments — entities capable of buying hugely expensive supercomputers and the staff to keep them running. One of the enabling technologies of the big data revolution is MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. About Hadoop Apache's Hadoop framework is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Back to top Data, data everywhere! Why Europe’s Largest Ad Targeting Platform Uses Hadoop. Richard Hutton, CTO of nugg.ad, authored the following post about how and why his company uses Apache Hadoop. nugg.ad operates Europe’s largest targeting platform.

Why Europe’s Largest Ad Targeting Platform Uses Hadoop

The company’s core business is to derive targeting recommendations from clicks and surveys. We measure these, store them in log files and later make sense of them all. In 2007 up until mid 2009 we used a classical data warehouse solution. As data volumes increased and performance suffered, we recognized that a new approach was needed. Data Processing Platform Requirements. Running Hadoop On Ubuntu Linux (Single-Node Cluster) @ Michael G. Noll. In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Running Hadoop On Ubuntu Linux (Single-Node Cluster) @ Michael G. Noll

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it. This tutorial has been tested with the following software versions: Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04) Hadoop 1.0.3, released May 2012 Sun Java 6 Disabling IPv6.

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) @ Michael G. Noll. In this tutorial I will describe the required steps for setting up a distributed, multi-nodeApache Hadoop cluster backed by the Hadoop Distributed File System (HDFS), running on Ubuntu Linux.

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) @ Michael G. Noll

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to In a previous tutorial, I described how to setup up a Hadoop single-node cluster on an Ubuntu box.

The main goal of this tutorial is to get a more sophisticated Hadoop installation up and running, namely building a multi-node cluster using two Ubuntu boxes. This tutorial has been tested with the following software versions: Figure 1: Cluster of machines running Hadoop at Yahoo! Let’s get started! Done?