background preloader

BigData Technology

Facebook Twitter

Elastic File System – Production-Ready in Three Regions. The portfolio of AWS storage products has grown increasingly rich and diverse over time.

Elastic File System – Production-Ready in Three Regions

Amazon S3 started out with a single storage class and has grown to include storage classes for regular, infrequently accessed, and archived objects. Similarly, Amazon Elastic Block Store (EBS) began with a single volume type and now offers a choice of four types of SAN-style block storage, each designed to be a great for a particular set of access patterns and data types. With object storage and block storage capably addressed by S3 and EBS, we turned our attention to the file system. We announced the Amazon Elastic File System (EFS) last year in order to provide multiple EC2 instances with shared, low-latency access to a fully-managed file system.

I am happy to announce that EFS is now available for production use in the US East (Northern Virginia), US West (Oregon), and Europe (Ireland) Regions. EFS offers two distinct performance modes. I mounted my file system as /efs, and there it was: — Jeff; Kimball Big Data Warehousing 101 - Business Intelligence, Analytics & Excel.

As I was writing about Ralph Kimball’s, excellent, “must-watch” Cloudera webinars on big data warehousing 101, I received a mass email from Kimball Group and notification via Melissa Coates aka @SQLChick on Twitter that the entire Kimball Group is retiring in December 2015.

Kimball Big Data Warehousing 101 - Business Intelligence, Analytics & Excel

I am totally shocked and a bit saddened by this news. Kimball Group might just be the #1, most respected group in all of traditional data warehousing. I am an avid fan and own a library of Kimball’s Toolkit books that have been invaluable throughout my career. I do hope that they will write at least one more book covering the massive big data, cloud and hybrid data world changes for data warehousing professionals. To get a current pulse on Hadoop data warehousing design impacts, I reached out to a few product team experts at Cloudera, Hortonworks and traditional relational database vendors. SQL and Hadoop: It's complicated. On and off, over the years, I have followed and written about the SQL-on-Hadoop saga.

SQL and Hadoop: It's complicated

The adventure started with Apache Hive, which originally provided a SQL layer on top of MapReduce, bringing new usability to Hadoop, but little utility for interactive query scenarios. Things got interesting in the fall of 2012, when Cloudera introduced the beta release of Impala, its SQL-on-Hadoop engine that bypassed MapReduce completely, providing for true interactive query over Hive-compatible data on Hadoop. A lot happened subsequent to that, but it can be pretty easily summarized as follows: (1) virtually every relational database and data warehouse vendor introduced an interactive SQL-on-Hadoop technology to query Hadoop data with its own query engine and (2) distributed memory and disk-based data framework Apache Spark became a thing, and the introduction of its Spark SQL module provided a way to query Hive-compatible data using its own processing substrate.

Bossie Awards 2015: The best open source big data tools. With hundreds of contributors, Spark is one of the most active and fastest-growing Apache projects, and with heavyweights like IBM throwing their weight behind the project and major corporations bringing applications into large-scale production, the momentum shows no signs of letting up.

Bossie Awards 2015: The best open source big data tools

The sweet spot for Spark continues to be machine learning. Highlights since last year include the replacement of the SchemaRDD with a Dataframes API, similar to those found in R and Pandas, making data access much simpler than with the raw RDD interface. Also new are ML pipelines for building repeatable machine learning workflows, expanded and optimized support for various storage formats, simpler interfaces to machine learning algorithms, improvements in the display of cluster resources usage, and task tracking. On by default in Spark 1.5 is the off-heap memory manager, Tungsten, which offers much faster processing by fine-tuning data structure layout in memory.

NoSQL Databases: An Overview. 223 Share 1.27k Share 0 Share 0 Share 0 Share 0 Share Over the last few years we have seen the rise of a new type of databases, known as NoSQL databases, that are challenging the dominance of relational databases.

NoSQL Databases: An Overview

Relational databases have dominated the software industry for a long time providing mechanisms to store data persistently, concurrency control, transactions, mostly standard interfaces and mechanisms to integrate application data, reporting. The dominance of relational databases, however, is cracking. Das grosse BigData Workbook. An Architect’s Guide to Big Data.

Big DATA - Advanced Analytics in Oracle Database.pdf.

Hadoop

Graph databases.