Scalability

> > > > > > > >

Real-time

Partition. Journal of Eivind Uggedal: NoSQL East 2009 - Summary of Day 2. An entry from 2009-10-30 in the Journal. Continuing from yesterday's talks here are my summary of the second and last day of NoSQL East. Pig--Kevin Weil--Twitter Data is getting big. NYSE produces 1TB of data every day, Facebook produces 20TB+ of compressed data each day, and CERN produces 40TB each day (15PB each year). This creates a need for multiple machines and horizontal scalability. Hadoop is two things: a distributed file system and a map/reduce framework for parallel computation. Map/reduce at twitter: how many tweets per user, given tweets table?

Input: key=row, value=tweet_info.Map: output key=user_id, value=1.Shuffle: sort by user_id (so that one can use more than one reducer).Reduce: for each user_id, sum.Output: user_id, tweet count. With 2x machines this job runs just about 2x faster. The problem with Hadoop is that analysis is typically written in Java. Pig is a high level language that can be easily read. How many request do we serve each day.What is the average latency? Wackamole: use your resources. Blog Archive » Why Facebook Will Die.