background preloader

Map-reduce-hadoop

Facebook Twitter

Votre première installation Hadoop. The 7 most common Hadoop and Spark projects. There's an old axiom that goes something like this: If you offer someone your full support and financial backing to do something different and innovative, they’ll end up doing what everyone else is doing.

The 7 most common Hadoop and Spark projects

So it goes with Hadoop, Spark, and Storm. Everyone thinks they're doing something special with these new big data technologies, but it doesn't take long to encounter the same patterns over and over. Specific implementations may differ somewhat, but based on my experience, here are the seven most common projects. 18 essential Hadoop tools for crunching big data. Getting Started with Hadoop. Hadoop In order to use the following guide, you should already have Hadoop up and running.

Getting Started with Hadoop

This can range from a deployed cluster containing multiple nodes or a single node pseudo-distributed Hadoop installation running locally. As long as you are able to run any of the examples on your Hadoop installation, you should be all set. The following versions of Hadoop are currently supported: MongoDB Install and run the latest version of MongoDB. Mongodb hadoop tutorial. Introduction to Hadoop and MapReduce. Traitements Big Data avec Apache Spark - 1ère partie : Introduction. Spark SQL and DataFrames - Spark 1.4.0 Documentation. Spark SQL is a Spark module for structured data processing.

Spark SQL and DataFrames - Spark 1.4.0 Documentation

It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. For how to enable Hive support, please refer to the Hive Tables section. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R.

All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell, pyspark shell, or sparkR shell. Starting Point: SQLContext The entry point into all functionality in Spark SQL is the SQLContext class, or one of its descendants. Tutoriel VirtualBox : installer une machine virtuelle Ubuntu. << Sommaire du tutoriel VirtualBox< 2.

Tutoriel VirtualBox : installer une machine virtuelle Ubuntu

Comment télécharger et installer VirtualBox sur Windows, Mac, ou Linux. Tutoriel Virtualbox. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

Présentation et mise en place de vagrant. Vagrant est un outil permettant de simplifier et d'automatiser la gestion de machines virtuelles.

Présentation et mise en place de vagrant

Qu'est ce que la virtualisation ? C'est très bien mais qu'est ce qu'une machine virtuelle ? Le principe d'une machine virtuelle est de récréer un environnement indépendant, qu'il soit logiciel ou matériel, en utilisant les ressources d'un environnement hôte. Vagrant, enlarge your VM. Vagrant, au cas où tu ne connaîtrais pas encore, permet de fournir des environnements de développements reproductibles, facilement configurables et qui se partagent entre les membres de l’équipe.

Vagrant, enlarge your VM

En gros, tu vas pouvoir décrire et configurer des machines virtuelles (VM) depuis un seul fichier texte, le Vagrantfile. Spark_tutorial_student. During this tutorial we will cover:¶ Part 1: Basic notebook usage and Python integration¶ Part 2: An introduction to using Apache Spark with the Python pySpark API running in the browser¶ Part 3: Using RDDs and chaining together transformations and actions¶

spark_tutorial_student

Initiation au MapReduce avec Apache Spark. Dans le précédent post, nous avons utilisé l’opération Map qui permet de transformer des valeurs à l’aide d’une fonction de transformation.

Initiation au MapReduce avec Apache Spark

Nous allons maintenant découvrir l’opération Reduce qui permet de faire des aggrégations. Nous allons ainsi pouvoir faire du MapReduce de la même manière qu’avec Hadoop. La théorie Avec Spark comme avec Hadoop, une opération de Reduce est une opération qui permet d’agréger les valeurs deux à deux, en procédant par autant d’étapes que nécessaire pour traiter l’ensemble des éléments de la collection. C’est ce qui permet au framework de réaliser des agrégations en parallèle, éventuellement sur plusieurs noeuds. Le framework va choisir deux éléments et les passer à une fonction que nous allons définir. Il en découle que le type en sortie de notre fonction doit être identique au type reçu en entrée : les valeurs doivent être homogènes. Présentation "Manifeste de Chris Date sur modèle « Objet Relationnel » (pour données structurées/SQL) Professeur Serge Miranda serge.miranda Directeur Master." M2 MBDS - Hadoop / Big Data. RDF Schema 1.1. Abstract RDF Schema provides a data-modelling vocabulary for RDF data.

RDF Schema 1.1

RDF Schema is an extension of the basic RDF vocabulary. Status of This Document This section describes the status of this document at the time of its publication. Other documents may supersede this document. This document is an edited version of the 2004 RDF Schema Recommendation. This document was published by the RDF Working Group as a Recommendation. This document has been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and is endorsed by the Director as a W3C Recommendation.

RDF 1.1 Concepts and Abstract Syntax. Abstract The Resource Description Framework (RDF) is a framework for representing information in the Web. This document defines an abstract syntax (a data model) which serves to link all RDF-based languages and specifications. The abstract syntax has two key data structures: RDF graphs are sets of subject-predicate-object triples, where the elements may be IRIs, blank nodes, or datatyped literals. They are used to express descriptions of resources. RDF datasets are used to organize collections of RDF graphs, and comprise a default graph and zero or more named graphs. Bib. Map Reduce - A really simple introduction « Kaushik Sathupadi. Ever since google published its research paper on map reduce, you have been hearing about it. Here and there.

If you have uptil now considered map-reduce a mysterious buzzword, and ignored it, Know that its not. The basic concept is really very simple. and in this tutorial I try to explain it in the simplest way that I can. Note that I have intentionally missed out some deeper details to make it really friendly to a beginner. Map / Reduce – A visual explanation. Map/Reduce is a term commonly thrown about these days, in essence, it is just a way to take a big task and divide it into discrete tasks that can be done in parallel. A common use case for Map/Reduce is in document database, which is why I found myself thinking deeply about this. Data-Intensive Text Processing with MapReduce.