background preloader

Hadoop

Facebook Twitter

The Data Blog: Aster Data Blog. I’ve been working in the analytics and database market for 12 years. One of the most interesting pieces of that journey has been seeing how the market is ever-shifting. Both the technology and business trends during these short 12 years have massively changed not only the tech landscape today, but also the future of evolution of analytic technology. From a “buzz” perspective, I’ve seen “corporate initiatives” and “big ideas” come and go. Everything from “e-business intelligence,” which was a popular term when I first started working at Business Objects in 2001, to corporate performance management (CPM) and “the balanced scorecard.”

The one golden thread that ties each of these terms, ideas and innovations together is that each is aiming to solve the questions related to what we are today calling “big data.” Mark Beyer from Gartner is credited with coining the term “logical data warehouse” and there is an interesting story and explanation. And How’d they do it? Data Lakes. Google Megastore: The Data Engine Behind GAE. Megastore is the data engine supporting the Google Application Engine.

It’s a scalable structured data store providing full ACID semantics within partitions but lower consistency guarantees across partitions. I wrote up some notes on it back in 2008 Under the Covers of the App Engine Datastore and posted Phil Bernstein’s excellent notes from a 2008 SIGMOD talk: Google Megastore. But there has been remarkably little written about this datastore over the intervening couple of years until this year’s CIDR conference papers were posted. CIDR 2011 includes Megastore: Providing Scalable, Highly Available Storage for Interactive Services. My rough notes from the paper: · Megastore is built upon BigTable · Bigtable supports fault-tolerant storage within a single datacenter · Synchronous replication based upon Paxos and optimized for long distance inter-datacenter links · Partitioned into a vast space of small databases each with its own replicated log · Each log stored across a Paxos cluster --jrh. Backtype: Using big data to make sense of social media.

To prepare for O’Reilly’s upcoming Strata Conference, we’re talking with some of the leading innovators working with big data and analytics. Today, we talk with Backtype’s lead engineer, Nathan Marz. Backtype is an “intelligence platform,” a suite of tools and insights that help companies quantify and understand the impact of their social media efforts. Marz works on the back end, figuring out ways to store and process terabytes of data from Twitter, Facebook, YouTube, and millions of blogs. The platform runs on Hadoop, and makes use of Cascading, a Java API for creating complex workflows for processing data. Marz likes working with the Java-based tool for abstracting details of Hadoop because, “I find that when you’re using a custom language you end up having a lot of complexity in your program that you don’t anticipate, especially when you try to do things that are more dynamic.” Big data tools and applications will be examined at the Strata Conference (Feb. 1-3, 2011).

Exhibitors: Strata 2011 - O'Reilly Conferences, February 01 - 03, 2011. Business Intelligence - The Dimensions. Hadoop. [repost]How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data « New IT Farmer. How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct. Bill Boebel, CTO of Mailtrust (Rackspace’s mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba’ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn’t compete, and finally to a Homo sapien’ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Rackspace faced a now familiar problem.

Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. HDFS-1432] HDFS across data centers: HighTide. Goal: The goal of the HighTideNode is to keep only one physical replica per data center. This is mostly for older files that change very infrequently.The HighTide server watches over the two HDFS namespaces from two different NameNodes in two different data centers. These two equivalent namespaces will be populated via means that are external to HighTide.

The HighTide server verifies (via checksum of the crc files) that two directories in the two HDFS contain identical data, and if so, reduces the replication factor to 2 on both HDFS. (One or both HDFS could be using HDFS-RAID too).The HighTideNode monitors any missing replicas on both namenode, and if it finds any it will fix by copying data from the other namenode in the remote data center. In short, the replication within a HDFS cluster will occur via the NameNode as usual.

DataNodeGateway:I envision a single HighTideNode coordinating replication between multiple HDFS clusters. HDFS-RAID: HighTide can co-exist with HDFS-RAID. Will Facebook (or Apple) Be the Next Great Hadoop Champion? Managing Big Data: Architectural Approaches for making batch data available online · Yahoo! Hadoop Blog. Hadoop · Search Results · Yahoo! Hadoop Blog. Nvidia denver. #p/u/23/gDBgdxAlDjw.

Dev Blog » Blog Archive » Gratuitous Hadoop: Stress Testing on the Cheap with Hadoop Streaming and EC2. Things have a funny way of working out this way. A couple features were pushed back from a previous release and some last minute improvements were thrown in, and suddenly we found ourselves dragging out a lot more fresh code in our release than usual. All this the night before one of our heavy API users was launching something of their own. They were expecting to hit us thousands of times a second and most of their calls touched some piece of code that hadn’t been tested in the wild.

Ordinarily, we would soft launch and put the system through its paces. But now we had no time for that. We really wanted to hammer the entire stack, yesterday, and so we couldn’t rely on internal compute resources. Typically, people turn to a service for this sort of thing but for the load we wanted, they charge many hundreds of dollars. At first, I thought we should use something like JMeter from some EC2 machines. I wanted to go to bed. So, it was settled — Hadoop it is!

Hadoop Streaming Designing Our Job. High Availability Hadoop. Hadoop Makes Sense of Lots of Data — EnterpriseStorageForum.com. Back to Page 1 Cloudera and Pentaho Build on Hadoop Mike Karp, an analyst with Ptak, Noel & Associates, cautions that any kind of open-source software is by its very nature a double-edged sword: Cheap to implement, but often hard to find adequate support, especially in the early stages of adoption.

"Most of where the support would come from, after all, is a group of volunteers; as a result, companies are often nervous about doing open source code with business-critical applications," said Karp. "The good news, of course, is that these volunteers frequently are often inspired to write great code, and there's plenty of evidence in the past that open-source projects have achieved great success. " That's where companies like Cloudera and Pentaho come in. Business Intelligence One of the primary value propositions is adding customer value. Top management, too, is beginning to realize the potential that might be sitting stored and unutilized within the enterprise.