background preloader

Hadoop

Facebook Twitter

Why Hadoop is hard, and how to make it easier. As my colleague Toby Wolpe wrote about earlier today, Gartner released a survey of its Research Circle members today showing that corporate adoption of Hadoop hasn't kept up with the hype.

Why Hadoop is hard, and how to make it easier

First of all, to use a technical term: "no duh. " For almost any new technology, there's typically a big differential between what the tech journalists and analysts are implying everybody's doing with that technology and what...everybody's doing with that technology. Second, while the context is that Gartner's survey found that -- to use Toby's wording -- "Just 26 percent are already deploying, piloting, or experimenting with Hadoop" (emphasis mine), I happen to think that's a very promising number. In fact, I would have guessed something a bit lower. Why? Hadoop 101: Simplifying MapReduce Development. Learning a new development framework takes time, and, as is well known, the Hadoop platform is no exception.

Hadoop 101: Simplifying MapReduce Development

MapReduce developers face a steep learning curve when first deploying and configuring a Hadoop cluster and later when verifying program correctness. Compounded by long execution times (measured in minutes), this often frustrates data scientists, especially those who lack a systems administration background. Even experienced Hadoop users run into a well-known set of “gotchas” that hinder their progress. Recognizing that these obstacles easily can create frustration and impede productivity, ScaleOut Software has focused on making it as easy as possible for developers to build high performance, distributed applications for the enterprise. Real-Time Stream Processing as Game Changer in a Big Data World with Hadoop and Data Warehouse. Review: Apache Hive brings real-time queries to Hadoop. Big Data Technology Solutions and Information content from The VAR Guy. Big data company MapR is eying new partner opportunities, better real-time data analytics and simpler self-service solutions for business intelligence with the release this week of the newest version of its Hadoop platform, as well as the introduction of new "quick-start" Hadoop applications for the enterprise.

Big Data Technology Solutions and Information content from The VAR Guy

Both developments support the trend toward "more self-service" for companies seeking to derive value from their data, MapR CMO Jack Norris said in an interview. The newest MapR release, version 4.1, includes, among other novel features, a POSIX client that makes it easier to connect to NFS data, as well as an API for C that enables companies to write Hadoop applications in that language. MapR 4.1 also provides a way to synchronize real-time data analytics between data centers and locations. Cloud Hadoop First Microsoft Azure Service Running on Linux. Microsoft rolled out a version of HDInsight – its cloud Hadoop distribution – that runs on Ubuntu at this week’s Strata + Hadoop World Conference in San Jose, California.

Cloud Hadoop First Microsoft Azure Service Running on Linux

While Linux instances have been available on Azure Infrastructure-as-a-Service cloud, this is the first Microsoft-managed and Azure-hosted service that runs on Linux. Most Hadoop deployments run on Linux, and Microsoft’s move puts HDInsight in a better position to attract more developers with Hadoop experience. The company has worked on making Hadoop more Windows-friendly since at least 2012, when it first submitted a proposal to improve Hadoop for Windows Server and Azure to Apache. Learning HBase: Shashwat Shriparv: 9781783985944: Amazon.com: Books. Big Data & Business Intelligence Books and eBooks. Social Media for Communities, Governance and Urban Daily Life.

Big Data. Big Data. Big Data: Hadoop in Distributed Mode. Running Hadoop On Ubuntu Linux (Single-Node Cluster) @ Michael G. Noll. In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Running Hadoop On Ubuntu Linux (Single-Node Cluster) @ Michael G. Noll

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it. This tutorial has been tested with the following software versions: Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04) Hadoop 1.0.3, released May 2012 Sun Java 6 Disabling IPv6. Get started with Hadoop: From evaluation to your first production cluster.

Hadoop is growing up.

Get started with Hadoop: From evaluation to your first production cluster

Apache Software Foundation (ASF) Hadoop and its related projects and sub-projects are maturing as an integrated, loosely coupled stack to store, process and analyze huge volumes of varied semi-structured, unstructured and raw data. (Updated March 12, 2012). Hadoop has come a long way in a relatively short time. Google papers on Google File System (GFS) and MapReduce inspired work on co-locating data storage and computational processing in individual notes spread across a cluster. Then, in early 2006 Doug Cutting joined Yahoo and set up a 300-node research cluster there, adapting the distributed computing platform that was formerly a part of the Apache Nutch search engine project. Now, the largest production clusters are 4,000 nodes with about 15 petabytes of storage in each cluster. Over time since I wrote an introduction to this emerging stack in August 2011, it has become easier to install, configure and write programs to use Hadoop.

Pick a distribution 1a.