BIG Data Analytics Pipeline "Big Data Analytics" has recently been one of the hottest buzzwords. It is a combination of "Big Data" and "Deep Analysis". The former is a phenomenon of Web2.0 where a lot of transaction and user activity data has been collected which can be mined for extracting useful information. Big Data Camp People working in this camp typically come from Hadoop, PIG/Hive background. From my personal experience, most of the people working in big data come from a computer science and distributed parallel processing system background but not from the statistical or mathematical discipline. Deep Analysis Camp On the other hand, people working in this camp usually come from statistical and mathematical background which the first thing being taught is how to use sampling to understand a large population's characteristic. Typical Data Processing Pipeline Learning from my previous projects, I observe most data processing pipeline fall into the following pattern. Big Data + Deep Analysis
Guidelines for Modeling and Optimizing NoSQL Databases - LaunchAny eBay Architect Jay Patel recently posted an article about data modeling using the Cassandra data store. In his article, he breaks down how they modeled their data using Cassandra, how they approached the use of Columns and Column Families, and query optimizations. The post is very detailed and a great read. What I enjoyed most from the article was more of the high-level approach that Jay and his team took. “It’s important to understand and start with entities and relationships…” Jay reminds us that we must first understand the problem domain, model the entities involved, and the relationships between the data. “…then continue modeling around query patterns by de-normalizing and duplicating.” You cannot optimize your data model until you understand how you will be accessing it. “Remember that there are many ways to model. Always evaluate your data model based on the intended use cases.
JOINs via denormalization for NoSQL coders, Part 2: Materialized views - Web development blog Thomas Wanschik on September 27, 2010 In part 1 we discussed a workaround for JOINs on non-relational databases using denormalization in cases for which the denormalized properties of the to-one side don't change. In this post we'll show one way to handle JOINs for mutable properties of the to-one side i.e. properties of users. Let's summarize our current situation: We have users (the to-one side) and their photos (the to-many side)Photos contain their users' gender in order to use it in queries which would need JOINs otherwise It's obvious that a solution for the problem of mutable properties on the to-one side has to keep denormalized properties up to date i.e. each time the user changes his/her gender (or more likely her hair color ;) we have to go through all of the user's photos and update the photos' denormalized gender. Background tasks to the rescue One way to solve the update-problem is to start a background task each time a user changes his/her gender. Materialized views
NoSQL Data Modeling Techniques « Highly Scalable Blog NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like the CAP theorem apply well to NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks the systematic theory found in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques. I would like to thank Daniel Kirkdorffer who reviewed the article and cleaned up the grammar. To explore data modeling techniques, we have to start with a more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. Key-Value storage is a very simplistic, but very powerful model. Conceptual Techniques
NoSQL Benchmarking NoSQL is the talk of the town. And we have already covered what it is for in one of our previous blogs. Today I would like to share the NoSQL benchmark test results we have recently conducted. It will help you to understand if the soon to develop system is compatible to NoSQL, and which NoSQL product to select. In this article we will reveal the characteristics of Cassandra, HBase and MongoDB identified through multiple workloads. Why NoSQL? The interest in NoSQL continues to rise because the amount of data to process continues to increase. Why are they using NoSQL instead of RDBMS? Twitter is still using MySQL. RDBMS is known to experience burden when processing tera or peta unit large sized data. There is no single correct answer in processing bulk data. Out of the RDBMSs, Oracle is an exception since Oracle’s performance and functions, such as mass data processing or data synchronization, are far more superior to other RDBMS. Benchmarking Tests using YCSB The test workload is as follows.