background preloader

Big Data and NOSQL

Facebook Twitter

What’s Next for Apache Hadoop Data Management and Governance: Cloudera Naviga... Learn about the new functionality coming aboard Cloudera Navigator, the trail-blazing solution for metadata management and lineage in Apache Hadoop.

What’s Next for Apache Hadoop Data Management and Governance: Cloudera Naviga...

More than two years ago, Cloudera introduced Cloudera Navigator 1.0, which was the first offering to unify auditing across enterprise Apache Hadoop deployments. About a year later, Cloudera released Cloudera Navigator 2.0, which introduced another first for Hadoop: comprehensive metadata management and lineage to Hadoop. Today, more than 200 customers across numerous industries use Cloudera Navigator in production to deliver trust and visibility to their Hadoop deployments. Today we are announcing exciting news for Cloudera Navigator: Cloudera Navigator has joined the Cloudera Accelerator Program, a partner program designed to expedite the development and certification of partner applications. We have enlisted many of our leading data management and governance partners into this program—with even more partners to follow. Delivered in 2.0.

Lambda Architecture

NoSQL Databases and Polyglot Persistence: A Curated Guide. Big Data Right Now: Five Trendy Open Source Technologies. Big Data is on every CIO’s mind this quarter, and for good reason.

Big Data Right Now: Five Trendy Open Source Technologies

Companies will have spent $4.3 billion on Big Data technologies by the end of 2012. But here’s where it gets interesting. Those initial investments will in turn trigger a domino effect of upgrades and new initiatives that are valued at $34 billion for 2013, per Gartner. Over a 5 year period, spend is estimated at $232 billion. Drill. Speed is Key Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast.

Drill

Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. Liberate Nested Data Perform interactive analysis on all of your data, including nested and schema-less. Drill supports querying against many different schema-less data sources including HBase, Cassandra and MongoDB. Flexibility Strongly defined tiers and APIs for straightforward integration with a wide array of technologies.

Disclaimer Apache Drill is an effort undergoing incubation at The Apache Software Foundation sponsored by the Apache Incubator PMC. The Database as a Value.

Map Reduce

p1150-stonebraker. CAP theorem. XRX. NOSQL Databases. The NoSQL movement. In a conversation last year, Justin Sheehy, CTO of Basho, described NoSQL as a movement, rather than a technology.

The NoSQL movement

This description immediately felt right; I’ve never been comfortable talking about NoSQL, which when taken literally, extends from the minimalist Berkeley DB (commercialized as Sleepycat, now owned by Oracle) to the big iron HBase, with detours into software as fundamentally different as Neo4J (a graph database) and FluidDB (which defies description). But what does it mean to say that NoSQL is a movement rather than a technology? We certainly don’t see picketers outside Oracle’s headquarters.

Justin said succinctly that NoSQL is a movement for choice in database architecture. There is no single overarching technical theme; a single technology would belie the principles of the movement. Think of the last 15 years of software development. Since the ’80s, the dominant back end of business systems has been a relational database, whether Oracle, SQL Server or DB2. The sacred cows.

Spanner

Big Table. Hadoop. Marriage of Hadoop and OLAP: Best of both worlds to make sense of 200 Terabytes of data. Like many other companies in the social networking world, Zoosk inherits a vast amount of data every day from user interactions, web logs, financial transactions, as well as standard business metric data.

Marriage of Hadoop and OLAP: Best of both worlds to make sense of 200 Terabytes of data

Making sense of the data and turning it into actionable intelligence is of utmost importance to Zoosk, where we are constantly trying to optimize our product offerings and business processes. The question is: how do we most effectively leverage our data, and turn it into business intelligence? There are a few typical approaches to answer this question. Traditionally to gain business intelligence, one can leverage a star schema data warehouse with a multi-dimensional OLAP engine, to provide the business with a user-friendly toolset to quickly “slice and dice” data to identify trends and patterns. These toolsets can be something that users are familiar with, such as Microsoft Excel and web dashboards.

So what do we do at Zoosk? 8 OLAP cubes 20+ Fact tables 150+ cube dimensions. Visual Guide to NoSQL Systems - Nathan Hurst's Blog. There are so many NoSQL systems these days that it's hard to get a quick overview of the major trade-offs involved when evaluating relational and non-relational systems in non-single-server environments.

Visual Guide to NoSQL Systems - Nathan Hurst's Blog

I've developed this visual primer with quite a lot of help (see credits at the end), and it's still a work in progress, so let me know if you see anything misplaced or missing, and I'll fix it. Without further ado, here's what you came here for (and further explanation after the visual). Note: RDBMSs (MySQL, Postgres, etc) are only featured here for comparison purposes. Also, some of these systems can vary their features by configuration (I use the default configuration here, but will try to delve into others later).