background preloader

Hadoop

Facebook Twitter

BYOH – Hadoop’s a Platform. Get Used To It. When is a technology offering a platform? Arguably, when people build products assuming it will be there. Or extend their existing products to support it, or add versions designed to run on it. Hadoop is there. The age of Bring Your Own Hadoop (BYOH) is clearly upon us. Specific support for components such as Pig and Hive vary, as do capabilities and levels of partnership in development, integration and co-marketing. Some vendors are in many categories – for example, Pentaho and IBM at opposite ends of the size spectrum interact with Hadoop in development tools, data integration, BI, and other ways. Analytic Platforms: Kognitio - an analytics-focused in-memory database with SQL 2011 support – offers significant Hadoop support.

Application Performance Management – longtime stalwart Compuware is bringing its portfolio to both Hadoop and leading NoSQL offerings. Database: some vendors, like IBM and Teradata offer their own distributions, and even appliances. Stack Up Hadoop to Find Its Place in Your Architecture. 2013 promises to be a banner year for Apache Hadoop, platform providers, related technologies – and analysts who try to sort it out. I’ve been wrestling with ways to make sense of it for Gartner clients bewildered by a new set of choices, and for them and myself, I’ve built a stack diagram that describes possible functional layers of a Hadoop-based model.

The model is not exhaustive, and it continually evolves. In my own continuous collection and update of market, usage and technical data, it serves as a scratchpad I use – every project/product name in the layers is a link to a separate slide in a large deck I use to follow developments. As you can see below, it contains many Apache and non-Apache pieces – projects, products, vendors – open and closed source. (updated 1/30/13) I expect further refinement of this stack in the weeks ahead, and more offerings at each layer as it evolves as well. Introduction of HBase Architecture. Big data, fast: Avoiding Hadoop performance bottlenecks. Hadoop shows a lot of promise as a relatively inexpensive landing place for the streams of big data coursing through organizations. The open source technology provides a distributed framework, built around highly scalable clusters of commodity servers, for processing, storing and managing data that fuels advanced analytics applications. But there's no such thing as a free lunch: In production use, achieving high levels of Hadoop performance can be a challenge.

Despite all the attention it's getting, Hadoop is still a relatively young technology -- it only reached Version 1.0 status in December 2011. As a result, much of the work being done with Hadoop by users remains somewhat experimental in nature, especially outside of the large Internet companies that helped to create it and that are replete with Java programmers and systems administrators versed in deploying the technology.

Hadoop users 'are dealing with an extremely deep protocol stack.' Knitting together a performance booster. Introducing Apache Hadoop: The Modern Data Operating System. Handling the hoopla: When to use Hadoop, and when not to. In the past few years, Hadoop has earned a lofty reputation as the go-to big data analytics engine. To many, it's synonymous with big data technology. But the open source distributed processing framework isn't the right answer to every big data problem, and companies looking to deploy it need to carefully evaluate when to use Hadoop -- and when to turn to something else. There's so much hype around [Hadoop] now that people think it does pretty much anything.Kelly Stirman, director of product marketing, 10gen Inc.

For example, Hadoop has ample power for processing large amounts of unstructured or semi-structured data. But it isn't known for its speed in dealing with smaller data sets. That has limited its application at Metamarkets Group Inc., a San Francisco-based provider of real-time marketing analytics services for online advertisers. Metamarkets CEO Michael Driscoll said the company uses Hadoop for large, distributed data processing tasks where time isn't a constraint. Big data architecture adds integration options -- and tool needs.

Big data technologies open up new options for storing and managing data -- potentially in concert with data warehouse systems, not as an alternative to them. That in turn creates new data integration opportunities, which might require additional tools to effectively support a big data architecture. Big data systems make it more feasible to store data "in a very crude fashion and refine it as needed" for particular uses, said Shawn Rogers, who heads business intelligence (BI) and data warehousing research at Enterprise Management Associates Inc. in Boulder, Colo. Rogers said Hadoop systems and NoSQL databases can serve as "a sort of loading dock" for raw data, with data models and schemas being applied to data sets "much later in the game than they used to be. " In such scenarios, data integration morphs from conventional extract, transform and load (ETL) processes into more malleable extract, load and transform (ELT) approaches.

Too much going on in a big data architecture. Big data, fast: Avoiding Hadoop performance bottlenecks. Hadoop shows a lot of promise as a relatively inexpensive landing place for the streams of big data coursing through organizations. The open source technology provides a distributed framework, built around highly scalable clusters of commodity servers, for processing, storing and managing data that fuels advanced analytics applications. But there's no such thing as a free lunch: In production use, achieving high levels of Hadoop performance can be a challenge. Despite all the attention it's getting, Hadoop is still a relatively young technology -- it only reached Version 1.0 status in December 2011. As a result, much of the work being done with Hadoop by users remains somewhat experimental in nature, especially outside of the large Internet companies that helped to create it and that are replete with Java programmers and systems administrators versed in deploying the technology.

Hadoop users 'are dealing with an extremely deep protocol stack.'Dominique Heger, CTO, DHTechnologies. GraphChi: How a Mac Mini Outperformed a 1,636 Node Hadoop Cluster - Christian Prokopp. Lucene - Apache Solr. Impala and Shark Benchmark | Sigmoid Analytics. We have created Impala and Shark cluster on Amazon EC2 m3.2xlarge machine with 30GB RAM. We loaded a 50 GB dataset into the system and run queries on top of it to benchmark the performance between the two systems: Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries.

Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop. The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation.

5 Reasons Hadoop Is Kicking Can and Taking Names. Splunk Seeks Non-Geeks with Platform Upgrade, Gives Away Storm to Developers. In full swing this week, Splunk’s fourth annual .conf gathering in Las Vegas kicked off this morning with two noteworthy announcements from the data management platform. The first is a refresh on its Enterprise suite, making version 6 of its platform available today. The second is a complete cloud version of its Enterprise suite, delivering a fully virtualized offering to appease the SaaS crowd. Splunk Chairman and CEO Godfrey Sullivan will provide the first public demonstration of Splunk Enterprise 6 during his keynote session at .conf2013. Democratizing Data It seems Splunk is seizing an opportunity to democratize data, approaching the Business Intelligence sector with enterprise-grade, cloud-ready analytics.

If you ask Splunk, their new Enterprise solution is “bridging the data divide” with more speed and power to unlock data’s value and grant access to the necessary individuals within an organization. Your complete guide to Splunk .conf2013 So what’s new in version 6? Bigger than IT. BMC Launches Big Data Management Solution for Hadoop. BMC launches a big data management solution for Hadoop, Dataguise raises $13 million for expansion, and IBM acquires big data analytics company The Now Factory. BMC launches big data management solution for Hadoop BMC announced the availability of BMC Control-M for Hadoop, a new big data management solution that dramatically reduces the batch processing time for extremely large collections of data sets, simplifying and automating Hadoop batch processing and connected enterprise workflows.

The new solution is a purpose-built version of the company’s Control-M workload automation offering. BMC Software specifically designed Control-M for Hadoop to improve service delivery by detecting slowdowns and failures with predictive analytics and intelligent monitoring of Hadoop application workflows. “BMC Control-M provides MetaScale with a control point that allows us to integrate all those big spokes to our hub which is big data.

Dataguise closes $13 million Series B funding.