background preloader


Facebook Twitter

HA Data Store

Using HBase Snapshots - Amazon Elastic MapReduce. HBase uses a built-in snapshot functionality to create lightweight backups of tables. In EMR clusters, these backups can be exported to Amazon S3 using EMRFS. You can create a snapshot on the master node using the HBase shell. This topic shows you how to run these commands interactively with the shell or through a step using command-runner.jar with either the AWS CLI or AWS SDK for Java.

For more information about other types of HBase backups, see HBase Backup in the HBase documentation. Create a snapshot using a table hbase snapshot create -n snapshotName -t tableName Using command-runner.jar from the AWS CLI: aws emr add-steps --cluster-id j-2AXXXXXXGAPLF \ --steps Name="HBase Shell Step",Jar="command-runner.jar",\ Args=[ "hbase", "snapshot", "create","-n","snapshotName","-t","tableName"] AWS SDK for Java: HadoopJarStepConfig hbaseSnapshotConf = new HadoopJarStepConfig() .withJar("command-runner.jar") .withArgs("hbase","snapshot","create","-n","snapshotName","-t","tableName"); Note.

Spark Streaming with HBase. This post will help you get started using Apache Spark Streaming with HBase on the MapR Sandbox. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. Editor’s Note: Download our free E-Book Getting Started with Apache Spark: From Inception to Production here. This post is the fifth in a series; if you are new to Spark, read these first: What is Spark Streaming? First of all, what is streaming? Website monitoring , Network monitoringFraud detectionWeb clicksAdvertisingInternet of Things: sensors Spark Streaming supports data sources such as HDFS directories, TCP sockets, Kafka, Flume, Twitter, etc.

How Spark Streaming Works Streaming data is continuous and needs to be batched to process. Architecture of the example Streaming Application The Spark Streaming example code does the following: Reads streaming data. Other Spark example code does the following: Example data set HBase Table schema The HBase Table Schema for the streaming data is as follows: Apache Spark Comes to Apache HBase with HBase-Spark Module - Cloudera Engineering Blog. The SparkOnHBase project in Cloudera Labs was recently merged into the Apache HBase trunk. In this post, learn the project’s history and what the future looks like for the new HBase-Spark module.

SparkOnHBase was first pushed to Github on July 2014, just six months after Spark Summit 2013 and five months after Apache Spark first shipped in CDH. That conference was a big turning point for me, because for the first time I realized that the MapReduce engine had a very strong competitor. Spark was about to enter an exciting new phase in its open source life cycle, and just one year later, it’s used at massive scale at 100s if not 1000s of companies (with 200+ of them doing so on Cloudera’s platform). SparkOnHBase came to be out of a simple customer request to have a level of interaction between HBase and Spark similar to that already available between HBase and MapReduce. Here’s a quick summary of the functionality that was in scope: Now, let’s dive into the technical details. Future Work.


Access HBase Tables with Hive - Amazon Elastic MapReduce. HBase and Hive and Amazon EMR (EMR 3.x Releases) are tightly integrated, allowing you run massively parallel processing workloads directly on data stored in HBase. To use Hive with HBase, you can usually launch them on the same cluster. You can, however, launch Hive and HBase on separate clusters. Running HBase and Hive separately on different clusters can improve performance because this allows each application to fully utilize cluster resources.

The following procedures show how to connect to HBase on a cluster using Hive. Note You can only connect a Hive cluster to a single HBase cluster. To connect Hive to HBase Create separate clusters with Hive and HBase installed or create a single cluster with both HBase and Hive installed.If you are using separate clusters, modify your security groups so that HBase and Hive ports are open between these two master nodes.Use SSH to connect to the master node for the cluster with Hive installed. To access HBase data from Hive. Chapter 14. Apache HBase (TM) Operational Management. This chapter will cover operational tools and practices required of a running Apache HBase cluster.

The subject of operations is related to the topics of but is a distinct topic in itself. Here we list HBase tools for administration, analysis, fixup, and debugging. There is a Driver class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example, HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar ... will return... An example program must be given as the first argument. ... for allowable program names. An fsck for your HBase install To run hbck against your HBase cluster run $ .

At the end of the commands output it prints OK or INCONSISTENCY. For more information, see Appendix B, hbck In Depth. The main method on HLog offers manual split and dump facilities. You can get a textual dump of a WAL file content by doing the following: $ . $ . Options: Args: tablename Name of table to copy. Apache HBase ™ Reference Guide. Phoenix in 15 minutes or less | Apache Phoenix. What is this new Phoenix thing I’ve been hearing about? Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.

Doesn’t putting an extra layer between my application and HBase just slow things down? Actually, no. Compiling your SQL queries to native HBase scans determining the optimal start and stop for your scan key orchestrating the parallel execution of your scans bringing the computation to the data by pushing the predicates in your where clause to a server-side filter executing aggregate queries through server-side hooks (called co-processors) In addition to these items, we’ve got some interesting enhancements in the works to further optimize performance: Ok, so it’s fast. Reduces the amount of code users need to write Allows for performance optimizations transparent to the user Opens the door for leveraging and integrating lots of existing tooling . Congratulations! HBase client application best practices - Hortonworks. Apache ZooKeeper - Home.