background preloader

Etsy/statsd

Etsy/statsd

Ganglia Monitoring System Samza Hosted Graphite - Graphite as a service, with StatsD and Grafana Dashboards Start page – collectd – The system statistics collection daemon Analyzing the Analyzers Large-scale Incremental Processing Using Distributed Transactions and Notifications Abstract: Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google's indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index.

mozilla/crontabber Welcome to Apache Flume — Apache Flume Mining Time-series with Trillions of Points: Dynamic Time Warping at scale Take a similarity measure that's already well-known to researchers who work with time-series, and devise an algorithm to compute it efficiently at scale. Suddenly intractable problems become tractable, and Big Data mining applications that use the metric are within reach. The classification, clustering, and searching through time series have important applications in many domains. In medicine EEG and ECG readings translate to time-series data collections with billions (even trillions) of points. The problem is that existing algorithms don't scale1 to sequences with hundreds of billions or trillions of points. Recently a team of researchers led by Eamonn Keogh of UC Riverside introduced a set of tools for mining time-series with trillions of points. What is Dynamic Time Warping? SQRT[ Σ (xi - yi)2 ] While ED is easy to define, it performs poorly as a similarity score. There are an exponential number of paths (from one time series to the other) through the warping matrix. 1. 1. 1.

Distributed stream processing showdown: S4 vs Storm | Kenkyuu S4 and Storm are two distributed, scalable platforms for processing continuous unbounded streams of data. I have been involved in the development of S4 (I designed the fault-recovery module) and I have used Storm for my latest project, so I have gained a bit of experience on both and I want to share my views on these two very similar and competing platforms. First, some commonalities. Both are distributed stream processing platforms, run on the JVM (S4 is pure Java while Storm is part Java part Clojure), are open source (Apache/Eclipse licenses), are inspired by MapReduce and are quite new. Now for some differences. Programming model. S4 implements the Actors programming paradigm. Storm does not have an explicit programming paradigm. To make things more clear, let’s use the classic “hello world” program from MapReduce: word count. Let’s say we want to implement a streaming word count. In synthesis, in S4 you program for a single key, in Storm you program for the whole stream. Data pipeline.

Using monitoring and metrics to learn in development etsy/oculus Munin

Related: