background preloader

Welcome to Apache Pig!

Welcome to Apache Pig!
Related:  Big Data - Gestion données de masse

Hadoop en 5 questions Quel est le principe de fonctionnement de cette infrastructure de traitement massif de données ? Quelles sont ses principales briques ? Quid des premières applications ? 1 - Qu'est ce qu'Hadoop ? Il s'agit d'un framework Open Source conçu pour réaliser des traitements sur des volumes de données massifs, de l'ordre de plusieurs petaoctets (soit plusieurs milliers de To). Hadoop a été conçu par Doug Cutting en 2004. Yahoo! 2 - Quel est le principe de fonctionnement de ce framework de traitement intensif ? Dans une logique d'architecture Hadoop, cette liste est découpée en plusieurs parties, chaque partie étant stockée sur une grappe de serveurs différente. 3 - Quelles en sont les différentes briques ? Poursuivons notre exemple. Les réseaux sociaux Facebook, Twitter et Linkedin repose sur Hadoop En aval, la distribution et la gestion des calculs est réalisé par MapReduce. Map qui s'applique sur une liste d'éléments. 4 - Au-delà de Yahoo! 5 - Quelles sont les applications possibles d'Hadoop ?

Hive! Home MapReduce Overview[edit] MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted. "Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. MapReduce allows for distributed processing of the map and reduction operations. Another way to look at MapReduce is as a 5-step parallel and distributed computation: Logical view[edit] Map(k1,v1) → list(k2,v2) Examples[edit]

The Apache Cassandra Project HBase Home » OpenStack Open Source Cloud Computing Software Distributed computing "Distributed Information Processing" redirects here. For the computer company, see DIP Research. Distributed computing is a field of computer science that studies distributed systems. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.[1] The components interact with each other in order to achieve a common goal. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.[2] There are many alternatives for the message passing mechanism, including RPC-like connectors and message queues. Distributed computing also refers to the use of distributed systems to solve computational problems. Introduction[edit] Other typical properties of distributed systems include the following: (a)–(b) A distributed system. Architecture[edit] Parallel and distributed computing[edit] History[edit] Applications[edit]

Welcome to Apache™ Hadoop®! Sqoop

A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs by sergeykucherov Jul 15