background preloader

BigData

Facebook Twitter

Running a Multi-Node Storm Cluster. In this tutorial I will describe in detail how to set up a distributed, multi-node Storm cluster on RHEL 6.

Running a Multi-Node Storm Cluster

We will install and configure both Storm and ZooKeeper and run their respective daemons under process supervision, similarly to how you would operate them in a production environment. I will show how to run an example topology in the newly built cluster, and conclude with an operational FAQ that answers the most common questions of managing a Storm cluster. Update Mar 2014: I have released a Wirbelsturm, a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure such as Apache Kafka and Apache Storm. Gathering LXC and Docker containers metrics. Linux Containers rely on control groups which not only track groups of processes, but also expose a lot of metrics about CPU, memory, and block I/O usage.

Gathering LXC and Docker containers metrics

We will see how to access those metrics, and how to obtain network usage metrics as well. This is relevant for “pure” LXC containers, as well as for Docker containers. Locate your control groups Control groups are exposed through a pseudo-filesystem. In recent distros, you should find this filesystem under /sys/fs/cgroup. On older systems, the control groups might be mounted on /cgroup, without distinct hierarchies. To figure out where your control groups are mounted, you can run: grep cgroup /proc/mounts Control groups hierarchies The fact that different control groups can be in different hierarchies mean that you can use completely different groups (and policies) for e.g. Of course, if you run LXC containers, each hierarchy will have one group per container, and all hierarchies will look the same. Moby Hyphenation List by Grady Ward. Hadoop Toolbox: When to Use What. Eight years ago not even Doug Cutting would have thought that the tool which he's naming after his kid's soft toy would so soon become a rage and change the way people and organizations look at their data.

Hadoop Toolbox: When to Use What

Today Hadoop and Big Data have almost become synonyms to each other. But Hadoop is not just Hadoop now. Over time it has evolved into one big herd of various tools, each meant to serve a different purpose. But glued together they give you a powerpacked combo. Having said that, one must be careful while choosing these tools for their specific use case as one size doesn't fit all. 1- Hadoop : Hadoop is basically 2 things, a distributed file system (HDFS) which constitutes Hadoop's storage layer and a distributed computation framework(MapReduce) which constitutes the processing layer. 2- Hbase : Hbase is a distributed, scalable, big data store, modelled after Google's BigTable. 3- Hive : Originally developed by Facebook, Hive is basically adata warehouse.

HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database. Mesos Hello world in Scala. 4. YARN - Hadoop: The Definitive Guide, 4th Edition. Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system.

4. YARN - Hadoop: The Definitive Guide, 4th Edition

YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well. YARN provides APIs for requesting and working with cluster resources, but these APIs are not typically used directly by user code. Instead, users write to higher-level APIs provided by distributed computing frameworks, which themselves are built on YARN and hide the resource management details from the user. The situation is illustrated in Figure 4-1, which shows some distributed computing frameworks (MapReduce, Spark, and so on) running as YARN applications on the cluster compute layer (YARN) and the cluster storage layer (HDFS and HBase). Apache Hadoop 2.6.0 - Hadoop Map Reduce Next Generation-2.6.0 - Writing YARN Applications. Concepts and Flow The general concept is that an application submission client submits an application to the YARN ResourceManager (RM).

Apache Hadoop 2.6.0 - Hadoop Map Reduce Next Generation-2.6.0 - Writing YARN Applications

This can be done through setting up a YarnClient object. After YarnClient is started, the client can then set up application context, prepare the very first container of the application that contains the ApplicationMaster (AM), and then submit the application. You need to provide information such as the details about the local files/jars that need to be available for your application to run, the actual command that needs to be executed (with the necessary command line arguments), any OS environment settings (optional), etc.

Effectively, you need to describe the Unix process(es) that needs to be launched for your ApplicationMaster. The YARN ResourceManager will then launch the ApplicationMaster (as specified) on an allocated container. During the execution of an application, the ApplicationMaster communicates NodeManagers through NMClientAsync object. Simple-yarn-app/Client.java at master · hortonworks/simple-yarn-app. Running scala programs on YARN. Apache YARN is Yet Another Resource Negotiator for distributed systems.

Running scala programs on YARN

It’s a distributed system resource scheduler similar to mesos. Yarn was created as effort to diversify the hadoop for different use cases. Yarn is available in all hadoop 2.x releases. In this post, we are going to discuss about how to run a scala program in yarn. Mapreduce-osdi04. Installing the Latest CDH 5 Release. If you are installing CDH 5 on a Red Hat system, you can download Cloudera packages using yum or your web browser.

Installing the Latest CDH 5 Release

If you are installing CDH 5 on a SLES system, you can download the Cloudera packages using zypper or YaST or your web browser. If you are installing CDH 5 on an Ubuntu or Debian system, you can download the Cloudera packages using apt or your web browser. On Red Hat-compatible Systems Use one of the following methods to add or build the CDH 5 repository or download the package on Red Hat-compatible systems. Note Use only one of the three methods. Do this on all the systems in the cluster. To download and install the CDH 5 "1-click Install" package: Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

Hourglass: a Library for Incremental Processing on Hadoop.