BigTable/HBase

The battle of the Cloud Frameworks: Application Servers redux? The battle of the Cloud Frameworks has started, and it will look a lot like the battle of the Application Servers which played out over the last decade and a half. Cloud Frameworks (which manage IT automation and runtime outsourcing) are to the Programmable Datacenter what Application Servers are to the individual IT server. In the longer term, these battlefronts may merge, but for now we’ve been transported back in time, to the early days of Web programming. The underlying dynamic is the same. It starts with a disruptive IT event (part new technology, part new mindset). 15 years ago the disruptive event was the Web.

Today it’s Cloud Computing. Stage 1 It always starts with very simple use cases. In that sense, the IaaS APIs of today are the equivalent of the Common Gateway Interface (CGI) circa 1993/1994. Stage 2 But the limitations became soon apparent. We haven’t reached that stage for Cloud yet. Stage 3 Stage 4 So what does it mean for Cloud Frameworks? It’s early We are at stage 1. Google grants license for Apache Hadoop. The Two Flavors of Google. A battle could be shaping up between the two leading software platforms for cloud computing, one proprietary and the other open-source Why are search engines so fast?

They farm out the job to multiple processors. Each task is a team effort, some of them involving hundreds, or even thousands, of computers working in concert. As more businesses and researchers shift complex data operations to clusters of computers known as clouds, the software that orchestrates that teamwork becomes increasingly vital. The state of the art is Google's in-house computing platform, known as MapReduce.

But Google (GOOG) is keeping that gem in-house. This means that the two leading software platforms for cloud computing could end up being two flavors of Google, one proprietary and the other—Hadoop—open source. Gaining Fans The growth of Hadoop creates a tangle of relationships in the world of megacomputing. Wider Participation Hadoop promises relief. Why Hadoop Users Shouldn’t Fear Google’s New MapReduce Patent. Updated: Google, nearly six years since it first applied for it, has finally received a patent for its MapReduce parallel programming model. The question now is how this will affect the various products and projects that utilize MapReduce.

If Google is feeling litigious, every database vendor leveraging MapReduce capabilities – a list that includes Aster Data Systems, Greenplum and Teradata — could be in trouble, as could Apache’s MapReduce-inspired Hadoop project. Hadoop is a critical piece of Yahoo’s web infrastructure, is the basis of Cloudera’s business model, and is the foundation of products like Amazon’s Elastic MapReduce and IBM’s M2 data-processing platform.

Fortunately, for them, it seems unlikely that Google will take to the courts to enforce its new intellectual property. A big reason is that “map” and “reduce” functions have been part of parallel programming for decades, and vendors with deep pockets certainly could make arguments that Google didn’t invent MapReduce at all. Google's MapReduce patent: what does it mean for Hadoop? The USPTO awarded search giant Google a software method patent that covers the principle of distributed MapReduce, a strategy for parallel processing that is used by the search giant. If Google chooses to aggressively enforce the patent, it could have significant implications for some open source software projects that use the technique, including the Apache Foundation's popular Hadoop software framework.

"Map" and "reduce" are functional programming primitives that have been used in software development for decades. A "map" operation allows you to apply a function to every item in a sequence, returning a sequence of equal size with the processed values. A "reduce" operation, also called "fold," accumulates the contents of a sequence into a single return value by performing a function that combines each item in the sequence with the return value of the previous iteration. Google's MapReduce framework is roughly based on those concepts. Listing image by Han Soete. Lineland: HBase vs. BigTable Comparison. HBase is an open-source implementation of the Google BigTable architecture. That part is fairly easy to understand and grasp. What I personally feel is a bit more difficult is to understand how much HBase covers and where there are differences (still) compared to the BigTable specification.

This post is an attempt to compare the two systems. Before we embark onto the dark technology side of things I would like to point out one thing upfront: HBase is very close to what the BigTable paper describes. Scope The comparison in this post is based on the OSDI'06 paper that describes the system Google implemented in about seven person-years and which is in operation since 2005. Towards the end I will also address a few newer features that BigTable has nowadays and how HBase is comparing to those. Terminology There are a few different terms used in either system describing the same thing.

Features Atomic Read/Write/Modify Yes, per row Lexicographic Row Order Yes Block Support Block Compression Printable.