background preloader

Architecture Lamba

Facebook Twitter

Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform (Part 2) | Confluent. This is the second part of our guide on streaming data and Apache Kafka. In part one I talked about the uses for real-time data streams and explained our idea of a stream data platform. The remainder of this guide will contain specific advice on how to go about building a stream data platform in your organization. This advice is drawn from our experience building and implementing Kafka at LinkedIn and rolling it out across all the data types and systems there.

It also comes from four years working with tech companies in Silicon Valley to build Kafka-based stream data platforms in their organizations. This is meant to be a living document. As we learn new techniques, or new tools become available, I’ll update it. Getting Started Much of the advice in this guide covers techniques that will scale to hundreds or thousands of well formed data streams.

Recommendations Limit The Number of Clusters The fewest number of clusters may not be one cluster. To keep activity local to a datacenter. Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform (Part 1) | Confluent. These days you hear a lot about “stream processing”, “event data”, and “real-time”, often related to technologies like Kafka, Storm, Samza, or Spark’s Streaming module.

Though there is a lot of excitement, not everyone knows how to fit these technologies into their technology stack or how to put it to use in practical applications. This guide is going to discuss our experience with real-time data streams: how to build a home for real-time data within your company, and how to build applications that make use of that data. All of this is based on real experience: we spent the last five years building Apache Kafka, transitioning LinkedIn to a fully stream-based architecture, and helping a number of Silicon Valley tech companies do the same thing.

The first part of the guide will give a high-level overview of what we came to call a “stream data platform”: a central hub for real-time streams of data. It will cover the what and why of this idea. But first, what is a stream data platform? Big Data Processing with Apache Spark – Part 1: Introduction. Traitements Big Data avec Apache Spark - 2ème partie : SparkSQL. Dans l’article précédent de cette série sur Apache Spark, nous avons vu de quoi est constitué le framework et en quoi celui-ci aide à répondre aux besoins d’analyses big data de l’entreprise.

Spark SQL, composant du framework Apache Spark, est utilisé pour effectuer des traitements sur des données structurées en exécutant des requêtes de type SQL sur les données Spark. Il nous permet d’exécuter des requêtes ad-hoc après une étape d’ETL sur des données stockées sous différents formats, comme JSON ou Parquet, ou des données stockées dans des bases de données par exemple. Dans ce deuxième article, nous examinerons la librairie Spark SQL et nous verrons comment celle-ci peut être utilisée pour effectuer des requêtes SQL sur des données stockées dans des fichiers plats, des jeux de données JSON ou des tables Hive. Spark 1.3 est la dernière version du framework, mise à disposition le mois dernier. Les composants de Spark SQL DataFrame SQLContext Sources de données JDBC L’application Spark SQL. Building Interactive Data Applications at Scale Presentation. Announcing Pulsar: Real-time Analytics at Scale | eBay Tech Blog. We are happy to announce Pulsar – an open-source, real-time analytics platform and stream processing framework.

Pulsar can be used to collect and process user and business events in real time, providing key insights and enabling systems to react to user activities within seconds. In addition to real-time sessionization and multi-dimensional metrics aggregation over time windows, Pulsar uses a SQL-like event processing language to offer custom stream creation through data enrichment, mutation, and filtering. Pulsar scales to a million events per second with high availability. It can be easily integrated with metrics stores like Cassandra and Druid. eBay provides a platform that enables millions of buyers and sellers to conduct commerce transactions. To help optimize eBay end users’ experience, we perform analysis of user interactions and behaviors.

Real-time reporting and dashboardsBusiness activity monitoringPersonalizationMarketing and advertisingFraud and bot detection TopN computation. Apache Kafka. Facet.js | home. Druid | Interactive Analytics at Scale. Apache Spark™ - Lightning-Fast Cluster Computing. Lambda Architectures in Practice.