S4: Distributed Stream Computing Platform

Distributed stream processing showdown: S4 vs Storm | Kenkyuu S4 and Storm are two distributed, scalable platforms for processing continuous unbounded streams of data. I have been involved in the development of S4 (I designed the fault-recovery module) and I have used Storm for my latest project, so I have gained a bit of experience on both and I want to share my views on these two very similar and competing platforms. First, some commonalities. Both are distributed stream processing platforms, run on the JVM (S4 is pure Java while Storm is part Java part Clojure), are open source (Apache/Eclipse licenses), are inspired by MapReduce and are quite new. Now for some differences. Programming model. S4 implements the Actors programming paradigm. Storm does not have an explicit programming paradigm. To make things more clear, let’s use the classic “hello world” program from MapReduce: word count. Let’s say we want to implement a streaming word count. In synthesis, in S4 you program for a single key, in Storm you program for the whole stream. Data pipeline.

Large-scale Incremental Processing Using Distributed Transactions and Notifications Abstract: Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google's indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index.

Samza etsy/statsd