background preloader

Stream processing

Facebook Twitter

Real-Time Stream Processing as Game Changer in a Big Data World with Hadoop and Data Warehouse. Real time insights into LinkedIn's performance using Apache Samza. It's not easy to quickly gather all the data that goes into a LinkedIn page view, particularly for something like our home page.

Real time insights into LinkedIn's performance using Apache Samza

LinkedIn benefits from a very distributed service-oriented architecture for assembling pages as quickly possible while still being resilient to failures from any particular bit of content. Each bit that ends up on the page is provided by separate services, each of which often will call upon other, subsequent services in order to finish its work. This approach is great for building a reliable, scalable website, but does make it more challenging to get a holistic view of everything that goes into building those pages, since the effort was distributed across many machines operating independently. Enter Apache Samza (incubating), which has allowed us to build a near real-time view of how pages are being built across hundreds of different services and thousands of machines.

Assembling a page view in hundreds of easy steps Job #1: Repartition on TreeId. TypeSafe's Kevin Webber: Actor-based Concurrency for Reactive Systems. In a recent article on Medium, TypeSafe's Kevin Webber argues that reactive programming "isn’t just another trend but rather the paradigm for modern software developers to learn" since it helps them to build systems that are responsive, resilient, and scalable.

TypeSafe's Kevin Webber: Actor-based Concurrency for Reactive Systems

He also suggests that actor-based concurrency is the most convenient foundations for a reactive system. According to reactive manifesto, reactive applications are built on four guiding principles: responsiveness, resiliency, scalability, and message-driven architecture. In Kevin's view, those four guiding principles are closely related in that both resiliency and scalability are necessary qualities to ensure that a system remains responsive, i.e., "quick to react to users," under varying working conditions and loads. Spot-on Introduction to Akka and Actors. More metrics in Apache Camel 2.14.

Apache Camel 2.14 is being released later this month.

More metrics in Apache Camel 2.14

There is a slight holdup due some Apache infrastructure issue which is being worked on. This blog post is to talk about one of the new functions we have added to this release. Thanks to Lauri Kimmel who donated a camel-metrics component, we integrated with the excellent codehale metrics library. So I took this component one step further and integrated it with the Camel routes so we have additional metrics about the route performances using codehale metrics. End to End Reactive Programming at Netflix. Twitter's Hadoop project gets Apache's blessing. Storm, a framework for real-time data processing in Hadoop, got a major promotion.

Twitter's Hadoop project gets Apache's blessing

After joining the Apache Incubator in September 2013, it's now a full-blown, top-level Apache Foundation project. Storm's main application is the processing of streaming real-time data (or "fast data," per John Hugg's description). Its processing power is designed to scale across multiple nodes, with up to 1 million, 100-byte messages per second per node as an advertised benchmark. As with most other work done for Hadoop, Java is the most broadly supported language for working in Storm, though other languages are in the mix. What's the significance of Storm becoming a "top-level project"? Storm first found a foothold at Twitter following the company's acquisition of original developer BackType in July 2011; Twitter later released Storm as open source.

Apache Camel 2.14: Java 8, Spring 4, REST DSL and Metrics. The Apache Camel team recently released version 2.14, their 66th release.

Apache Camel 2.14: Java 8, Spring 4, REST DSL and Metrics

Camel is an open-source integration framework that provides components based on the popular enterprise integration patterns. It allows an application to define route and mediation rules in many domain-specific languages (DSLs), for example with Java, XML, Groovy and Scala. New features include a REST DSL and Swagger integration for easily documenting an API, as well as new support for Java 8 (Java 6 is no longer supported) and Spring 4 (users of Spring 3.x or earlier will need to use camel-test-spring3 for testing). A Java 8 specific DSL that allows for lambda expressions was deferred until the next release. In a blog post entitled Easy REST endpoints with Apache Camel 2.14, Christian Posta, Principal Middleware Specialist/Architect at Red Hat wrote: I recently integrated Metrics into an Apache Camel / CXF / Spring Boot application and found it took just one line in my class.

Apache Spark Plus Many Other Frameworks: How Spark Fits into the Big Data Landscape. Unified Big Data Processing with Apache Spark. GearPump Real-time Streaming Engine Using Akka. Today we are excited to share a technical article by our friends at GearPump ( a high performance, lightweight, real-time streaming engine built on top of Akka.

GearPump Real-time Streaming Engine Using Akka

Sean Zhong, Kam Kasravi, Huafeng Wang, Manu Zhang and Weihua Jiang have written a quick summary below, linking to the full paper. Read on! Big data streaming frameworks today must process immense amounts of data from an expanding set of disparate data sources. Beyond the requirements of fault tolerance, scalability and performance, compelling streaming engines should provide a programming model where computations can be easily expressed and deployed in a distributed manner. Akka is a natural fit to meet these requirements. Not Just Layers! What Can Pipelines and Events Do for You? Very fast Camels and Cloud Messaging. Apache Camel is a popular, mature, open-source integration library.

Very fast Camels and Cloud Messaging

It implements the Enterprise Integration Patterns which is a set of patterns that often come up when integrating distributed systems. I’ve written a lot about Camel in the past, including why I like it better than Spring Integration, how the routing engine works, how to use JMS selectors with AWS SQS, and a lot more. Camel also implements 197 connectors/adapters for talking to external systems (go to the source code, components/ directory and run this: ls -lp components/ | grep / | wc -l), github has a lot more, and you can write your own pretty trivially. This gives Camel a much broader range of connectivity options than any other integration library. Recently, I was fortunate to be able to help out a top, household-named, e-retailer with their use of Camel. The nature of this type of retail business in question is very seasonal. Big Data Processing with Apache Spark – Part 1: Introduction.