background preloader

Spark

Facebook Twitter

Collecte de tickets de caisse : vue sur l’architecture. Suite à notre premier article sur les enjeux métiers que représentent la collecte et l’analyse de la donnée dans le secteur de la grande distribution, nous allons présenter un use case et les problématiques qui y sont associées.

Collecte de tickets de caisse : vue sur l’architecture

Nous verrons comment leur faire face en se basant sur des technologies récentes qui ont déjà fait leurs preuves chez les géants du Web : Kafka, Spark et Cassandra. Le contexte Un acteur de la grande distribution souhaite pouvoir remonter et traiter en temps réel des données telles que les tickets de caisse émis, les flux de marchandises entre ses fournisseurs, entrepôts et magasins, et le parcours utilisateur sur son site e-commerce. Par exemple, il souhaite pouvoir obtenir en temps réel la performance du chiffre d’affaire de chacune des catégories de produits comparée au même jour de la semaine précédente (cf. graphe ci-dessous) de sorte à pouvoir déclencher des actions marketing avec une bonne réactivité.

Streaming Architecture. Performance Tuning of an Apache Kafka/Spark Streaming System. GitHub - ZubairNabi/prosparkstreaming: Code used in "Pro Spark Streaming: The Zen of Real-time Analytics using Apache Spark" published by Apress Publishing. Holdenk/spark-testing-base: Base classes to use when writing tests with Spark. Running Spark Tests in Standalone Cluster - Eugene Zhulenev. Home · fluxcapacitor/pipeline Wiki. GitHub - databricks/sbt-spark-package: Sbt plugin for Spark packages. GitHub - databricks/spark-corenlp: CoreNLP wrapper for Spark. GitHub - holdenk/spark-testing-base: Base classes to use when writing tests with Spark. Kafka-spark-consumer. Receiver Based Reliable Low Level Kafka-Spark Consumer for Spark Streaming .

kafka-spark-consumer

Built-in Back-Pressure Controller . ZK Based offset Management . WAL Less Recovery . Custom Message Interceptor This utility will help to pull messages from Kafka using Spark Streaming and have better handling of the Kafka Offsets and failures. Salient Features of Kafka Spark Consumer ==================================-- This Consumer uses Zookeeper for storing the consumed and processed offset for each Kafka partition, which will help to recover in case of failure-- Spark streaming job using this Consumer does not require WAL for recovery from Driver or Executor failures.

Tags How to. Mllib-grid-search. GitHub - FRosner/spawncamping-dds: Data-Driven Spark allows quick data exploration based on Apache Spark. Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle. Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from using Spark.

Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle

I started using Apache Spark in late 2014, learning it at the same time as I learned Scala, so I had to wrap my head around the various complexities of a new language as well as a new computational framework. This process was a great in-depth introduction to the world of Big Data (I previously worked as an electrical engineer for Boeing), and I very quickly found myself deep in the guts of Spark. The hands-on experience paid off; I now feel extremely comfortable with Spark as my go-to tool for a wide variety of data analytics tasks, but my journey here was no cakewalk. Capital One’s original use case for Spark was to surface product recommendations for a set of 25 million users and 10 million products, one of the largest datasets available for this type of modeling.

Note: this post is not intended as a ground-zero introduction. The Pieces. Pro spark streaming real time analytics. GitHub - databricks/learning-spark: Example code from Learning Spark book. Spark Streaming demo. Spark Streaming demo. Databricks Spark Reference Applications. In this reference application, we show how you can use Apache Spark for training a language classifier - replacing a whole suite of tools you may be currently using.

Databricks Spark Reference Applications

This reference application was demo-ed at a meetup which is taped here - the link skips straight to demo time, but the talk before that is useful too: Here are 5 typical stages for creating a production-ready classifier. Often, each stage is done with a different set of tools and even by different engineering teams: Scrape/collect a dataset.Clean and explore the data, doing feature extraction.Build a model on the data and iterate/improve it.Improve the model using more and more data, perhaps upgrading your infrastructure to support building larger models.

(Such as migrating over to Hadoop.)Apply the model in real time. Spark can be used for all of the above and simple to use for all these purposes. Spark Tutorial. What's Spark?

Spark Tutorial

Big data and data science are enabled by scalable, distributed processing frameworks that allow organizations to analyze petabytes of data on large commodity clusters. MapReduce (especially the Hadoop open-source implementation) is the first, and perhaps most famous, of these frameworks. Using MongoDB with Apache Spark. This is a guest blog from Matt Kalan, a Senior Solution Architect at MongoDB Introduction The broad spectrum of data management technologies available today makes it difficult for users to discern hype from reality.

Using MongoDB with Apache Spark

While I know the immense value of MongoDB as a real-time, distributed operational database for applications, I started to experiment with Apache Spark because I wanted to understand the options available for analytics and batch operations. I started with a simple example of taking 1-minute time series intervals of stock prices with the opening (first) price, high (max), low (min), and closing (last) price of each time interval and turning them into 5-minute intervals (called OHLC bars). Introducing Window Functions in Spark SQL. In this blog post, we introduce the new window function feature that was added in Apache Spark 1.4.

Introducing Window Functions in Spark SQL

Window functions allow users of Spark SQL to calculate results such as the rank of a given row or a moving average over a range of input rows. They significantly improve the expressiveness of Spark’s SQL and DataFrame APIs. This blog will first introduce the concept of window functions and then discuss how to use them with Spark SQL and Spark’s DataFrame API. What are Window Functions? Before 1.4, there were two kinds of functions supported by Spark SQL that could be used to calculate a single return value.

RDDs are the new bytecode of Apache Spark. With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I’d recommend to read the article : Introducing Dataframes in Spark for Large Scale Data Science from the Databricks blog.

RDDs are the new bytecode of Apache Spark

Dataframes are very popular among data scientists, personally I’ve mainly been using them with the great Python library Pandas but there are many examples in R (originally) and Julia. Of course if you’re using only Spark’s core features, nothing seems to have changed with Spark 1.3 : Spark’s main abstraction remains the RDD (Resilient Distributed Dataset), its API is very stable now and everyone used it to handle any kind of data since now. But the introduction of Dataframe is actually a big deal, because when RDDs were the only option to load data, it was obvious that you needed to parse your « maybe » un-structured data using RDDs, transform them using case-classes or tuples and then do the special work that you actually needed. ). Blog Technique Xebia - Cabinet de conseil IT. Présentation Pour vous présenter SparkR nous allons nous appuyer tout au long de l’article sur un jeu de données : Titanic.

Blog Technique Xebia - Cabinet de conseil IT

Il contient des informations sur tous les passagers du Titanic (nom, adresse, prix du billet, classe etc…) et notamment si le passager en question a survécu. Ce jeu de données est disponible sur le site Kaggle (kaggle.com), une plateforme regroupant des concours de data science. Il est proposé aux concurrents de prédire si oui on non les individus ont survécu en fonction des autres informations. Spark data frames from CSV files: handling headers & column types - Nodalpoint. >>> taxi_df = taxiNoHeader.map(lambda k: k.split(",")).map(lambda p: (p[0].strip(), parse(p[2].strip(), p[6].strip()), float(p[9]), float(p[10]), int(p[11]), p[12].strip()) ).toDF(schema) >>> taxi_df.head(10) [Row(id=u'e6b3fa7bee24a30c25ce87e44e714457', rev=u'1-9313152f4894bb47678d8ce98e9ec733', dropoff_datetime=datetime.datetime(2013, 2, 9, 18, 16), dropoff_latitude=40.73524856567383, dropoff_longitude=-73.99406433105469, hack_license=u'88F8DD623E5090083988CD32C84973E3', medallion=u'6B96DDFB5A50B96E72F5808ABE778B17', passenger_count=1, pickup_datetime=datetime.datetime(2013, 2, 9, 17, 59), pickup_latitude=40.775123596191406, pickup_longitude=-73.96345520019531, rate_code=1, store_and_fwd_flag=u'', trip_distance=3.4600000381469727, trip_time_in_secs=1020, vendor_id=u'VTS'),

Spark data frames from CSV files: handling headers & column types - Nodalpoint

Dataframes from CSV files in Spark 1.5: automatic schema extraction, neat summary statistics, & elementary data exploration - Nodalpoint. In a previous post, we glimpsed briefly at creating and manipulating Spark dataframes from CSV files. In the couple of months since, Spark has already gone from version 1.3.0 to 1.5, with more than 100 built-in functions introduced in Spark 1.5 alone; so, we thought it is a good time for revisiting the subject, this time also utilizing the external package spark-csv, provided by Databricks. We’ll use the same CSV file with header as in the previous post, which you can download here. In order to include the spark-csv package, we must start pyspark with the folowing argument: If this is the first time we use it, Spark will download the package from Databricks’ repository, and it will be subsequently available for inclusion in future sessions.

So, after the numerous INFO messages, we get the welcome screen, and we proceed to import the necessary modules: Automatic schema extraction. Pandarize your Spark DataFrames - Base Lab. About DataFrames In the last blog post I gave you an overview of our Data Science stack based on Python. This time let’s focus on one important component: DataFrames. DataFrames are a great abstraction for working with structured and semi-structured data. They are basically a collection of rows, organized into named columns. Think of relational database tables: DataFrames are very similar and allow you to do similar operations on them: slice data: select subset of rows or columns based on conditions (filters)sort data by one or more columnsaggregate data and compute summary statisticsjoin multiple DataFrames What makes them much more powerful than SQL is the fact that this nice, SQL-like API is actually exposed in a full-fledged programming language.

DataFrames were popularized by R and then adopted by other languages and frameworks. Big Data University: Login to the site. Introducing DataFrames in Spark for Large Scale Data Science. Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs).

This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens. As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind. For new users familiar with data frames in other programming languages, this API should make them feel at home. Time Series Stream Processing with Spark and Cassandra. A KMeans example for Spark MLlib on HDInsight.

Today we will take a look at Sparks’s module for MLlib or its built-in machine learning library Sparks MLlib Guide . KMeans is a popular clustering method. Clustering methods are used when there is no class to be predicted but instances are divided into groups or clusters. The clusters hopefully will represent some mechanism at play that draws the instance to a particular cluster. The instances assigned to the cluster should have a strong resemblance to each other. A typical use case for KMeans is segmentation of data. Getting Started with Cassandra and Spark. ##Introduction This tutorial is going to go through the steps required to install Cassandra and Spark on a Debian system and how to get them to play nice via Scala. Spark and Cassanrda exist for the sake of applications in Big Data, as such they are intended for installation on a cluster of computers possibly spread over multiple geographic locations.

This tutorial, however, will deal with a single computer installation. The aim of this tutorial is to give you a starting point from which to configure your cluster for your specific application, and give you a few ways to make sure your software is running correctly. Utilisation de Cassandra en tant que RDD Spark avec le connecteur Datastax. Introduction Dans cet article, nous allons voir comment il est possible d'utiliser Cassandra et Spark, pour effectuer des opérations sur une grande quantité de données, le tout de manière distribuée.

Type safety on Spark DataFrames - Part 1 — 51zero Ltd. Without a good documentation, it is impossible to know: what are the required columns in the input DataFrame? What are the columns added to the output DataFrame? What are the types of the input/output columns: are they String, Double, Int? If you have a non-trivial program which composes several such transformations, it becomes tricky to follow what is going on. Without proper unit testing, your program becomes brittle and breaks with simple changes.You start to feel as if you were using some kind of dynamic language. Databricks: Apache Spark - The SDK for All Big Data Platforms. Python - Column filtering in PySpark. Learning Spark - O'Reilly Media. Written by a groups of enthusiasts and developers, including the original creator of the framework itself, Matei, Learning Spark targets data scientists and engineers.

As expressly written on the back cover, this book is neither a reference nor a cookbook. Its goal is to presents a different, faster alternative to the Hadoop's Map/Reduce paradigm and to the elephant made in Apache itself. The reader is given a quick overview of the capabilities of the framework, such as the built-in libraries, Spark SQL and the many different data sources it can interact with. While not all the main features are presented, those that are found within these almost three-hundreds pages come with plenty of well explained examples. The examples are, on the other hand, one of the many perplexities raised by this text: each is presented in Python, Java and Scala. Another thumb down for the complete absence of anything related to the Spark's internal architecture.

Databricks Spark Knowledge Base. GitBook Sign In Sign Up Blog Explore Pricing. GitHub - databricks/spark-knowledgebase: Spark Knowledge Base. DataStax Academy: Free Cassandra Tutorials and Training. Databricks: Apache Spark - The SDK for All Big Data Platforms. GitHub - Huawei-Spark/Spark-SQL-on-HBase: Native, optimized access to HBase Data through Spark SQL/Dataframe Interfaces. Spark Programming Guide - Spark 0.9.0 Documentation. At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures. A second abstraction in Spark is shared variables that can be used in parallel operations. Spark SQL and DataFrames - Spark 1.5.2 Documentation. Spark SQL and DataFrames - Spark 1.5.2 Documentation. SparkSQL, How to re-use a Hive Custom UDF (Java)? - Databricks Community Forum.