background preloader

Veille Techno

Facebook Twitter

Apache Cassandra with Hector - An Example. Recently I had been to the Strange Loop Conference in Saint Louis. While there I indulged in two things primarily, booze with old buddies and No SQL in the conference. In particular, I found a lot of mention of Apache Cassandra. Why would one care about Cassandra, how about a 150 TB cluster spanning over 150 machines at Facebook ? Cassandra is used by organizations such as Digg, Twitter etc who deal with a large amount of data. I could attempt to write more on Cassandra but there is a great presentation by Eric Evans on the same If not talking about Cassandra, what am I talking about?

So I am only going to use Java, sorry no Ruby or Scala for me right now. The model created was based Arin's schema with a few enhancements. 02. 03. 05. 06. 07. Gephi - Plugins. Static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012. Predicting what topics will trend on Twitter. Twitter’s home page features a regularly updated list of topics that are “trending,” meaning that tweets about them have suddenly exploded in volume.

Predicting what topics will trend on Twitter

A position on the list is highly coveted as a source of free publicity, but the selection of topics is automatic, based on a proprietary algorithm that factors in both the number of tweets and recent increases in that number. At the Interdisciplinary Workshop on Information and Decision in Social Networks at MIT in November, Associate Professor Devavrat Shah and his student Stanislav Nikolov will present a new algorithm that can, with 95 percent accuracy, predict which topics will trend an average of an hour and a half before Twitter’s algorithm puts them on the list — and sometimes as much as four or five hours before.

Let the data decide In the standard approach to machine learning, Shah explains, researchers would posit a “model” — a general hypothesis about the shape of the pattern whose specifics need to be inferred. Keeping pace. Welcome to Apache™ Hadoop®! Facebook Corona. Apache Mahout: Scalable machine learning and data mining. Welcome to Apache Pig! Welcome to Hive! Hadoop, Big Data, and Enterprise Business Intelligence. Many thanks to William Gardella and others for the content below: This post is an attempt to summarize the current state of play with Hadoop, “Big Data” and Enterprise BI, and what it means to existing users of enterprise business intelligence.

Hadoop, Big Data, and Enterprise Business Intelligence

See the list of articles at the end of the post for more detailed materials. What is Hadoop? Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. It is: Reliable: The software is fault tolerant, it expects and handles hardware and software failuresScalable: Designed for massive scale of processors, memory, and local attached storageDistributed: Handles replication. Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. HDFS: Hadoop Distributed File SystemHBase: Column oriented, non-relational, schema-less, distributed database modeled after Google’s BigTable. Image: William Gardella Are Companies Adopting Hadoop?

Yes. Www.christof-strauch.de/nosqldbs.pdf. NoSQL databases comparison. Translation SQL to MapReduce. Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison. (Yes it's a long title, since people kept asking me to write about this and that too :) I do when it has a point.)

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

While SQL databases are insanely useful tools, their monopoly in the last decades is coming to an end. And it's just time: I can't even count the things that were forced into relational databases, but never really fitted them. (That being said, relational databases will always be the best for the stuff that has relations.) But, the differences between NoSQL databases are much bigger than ever was between one SQL database and another.

This means that it is a bigger responsibility on software architects to choose the appropriate one for a project right at the beginning. Survey distributed databases - Toad for Cloud Wiki. Overview Top This document, researched and authored by Quest's chief software architect Randy Guck, provides a summary of distributed databases. MongoDB. Mongo db – document oriented database. Hypertable. Apache CouchDB. Document-Oriented Databases: Couchdb Primer. HBase. Facebook Messages & HBase. The Apache Cassandra Project. Indexing in Cassandra.

Scaling Twitter with Cassandra. Netflix on Cassandra. Rainbird: Realtime Analytics at Twitter (Strata 2011) DataModel. Cassandra is a partitioned row store, where rows are organized into tables with a required primary key.

DataModel

The first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the PK. Other columns may be indexed independent of the PK. This allows pervasive denormalization to "pre-build" resultsets at update time, rather than doing expensive joins across the cluster. DataStax has a good introduction to data modeling in Cassandra here. For more detail, see Patrick McFadin's data modeling series: DataStax Cassandra 1.1 Documentation.