Hbase

> > > > > > > >

LINE Storage: Storing billions of rows in Sharded-Redis and HBase per Month. Hi, I’m Shunsuke Nakamura (@sunsuk7tp). Just half a year ago, I completed the Computer Science Master’s program in Tokyo Tech and joined to NHN Japan as a member of LINE server team. My ambition is to hack distributed processing and storage systems and develop the next generation’s architecture. In the LINE server team, I’m in charge of development and operation of the advanced storage system which manages LINE’s message, contacts and groups. Today, I’ll briefly introduce the LINE storage stack. LINE Beginning with Redis [2011.6 ~] In the beginning, we adopted Redis for LINE’s primary storage. The larger the scale of the service, the more nodes were needed, and client-side sharding prevented us from scaling effectively.

This manager has the following characteristics: Sharding management by ZooKeeper (Consistent hashing, compatible with other algorithms)Failure detection and auto/manual failover between master and slaveScales out with minimal downtime (< 10 sec) Data Scalability HBase and HDFS.

Xfs

Peer-to-peer keyword searching. Practical scalable b-tree. Hypercubes in Hbase « MyNoSQL. Lineland: HBase Architecture 101 - Storage. One of the more hidden aspects of HBase is how data is actually stored. While the majority of users may never have to bother about it you may have to get up to speed when you want to learn what the various advanced configuration options you have at your disposal mean. "How can I tune HBase to my needs? ", and other similar questions are certainly interesting once you get over the (at times steep) learning curve of setting up a basic system. Another reason wanting to know more is if for whatever reason disaster strikes and you have to recover a HBase installation. In my own efforts getting to know the respective classes that handle the various files I started to sketch a picture in my head illustrating the storage architecture of HBase. But while the ingenious and blessed committers of HBase easily navigate back and forth through that maze I find it much more difficult to keep a coherent image.

So what does my sketch of the HBase innards really say? 01. $ hadoop dfs -lsr /hbase/docs 05. Hbase datacenter replication. HBase should consider supporting a federated deployment where someone might have terascale (or beyond) clusters in more than one geography and would want the system to handle replication between the clusters/regions. It would be sweet if HBase had something on the roadmap to sync between replicas out of the box. Consider if rows, columns, or even cells could be scoped: local, or global. Then, consider a background task on each cluster that replicates new globally scoped edits to peer clusters. The HBase/Bigtable data model has convenient features (timestamps, multiversioning) such that simple exchange of globally scoped cells would be conflict free and would "just work".

Implementation effort here would be in producing an efficient mechanism for collecting up edits from all the HRS and transmitting the edits over the network to peers where they would then be split out to the HRS there. Holding on to the edit trace and tracking it until the remote commits succeed would also be necessary. Hbase coprocessors. From Google's Jeff Dean, in a keynote to LADIS 2009 ( slides 66 - 67): BigTable Coprocessors (New Since OSDI'06) Arbitrary code that runs run next to each tablet in table As tablets split and move, coprocessor code automatically splits/moves too High-level call interface for clients Unlike RPC, calls addressed to rows or ranges of rows coprocessor client library resolves to actual locations Calls across multiple rows automatically split into multiple parallelized RPCs Very flexible model for building distributed services Automatic scaling, load balancing, request routing for apps Example Coprocessor Uses Scalable metadata management for Colossus (next gen GFS-like file system) Distributed language model serving for machine translation system Distributed query processing for full-text indexing support Regular expression search support for code repository.

Lineland: HBase vs. BigTable Comparison. HBase is an open-source implementation of the Google BigTable architecture. That part is fairly easy to understand and grasp. What I personally feel is a bit more difficult is to understand how much HBase covers and where there are differences (still) compared to the BigTable specification. This post is an attempt to compare the two systems.

Before we embark onto the dark technology side of things I would like to point out one thing upfront: HBase is very close to what the BigTable paper describes. Putting aside minor differences, as of HBase 0.20 , which is using ZooKeeper as its lock distributed coordination service, it has all the means to be nearly an exact implementation of BigTable's functionality.

Scope The comparison in this post is based on the OSDI'06 paper that describes the system Google implemented in about seven person-years and which is in operation since 2005. Terminology There are a few different terms used in either system describing the same thing. Features Yes, per row.