background preloader

Big data

Facebook Twitter

Anything to do with big data, hadoop, mongoDB,etc

Translate SQL to MongoDB MapReduce. I keep hearing people complaining that MapReduce is not as easy as SQL. But there are others saying SQL is not easy to grok. I’ll keep myself away from this possible flame war and just point you out to this ☞ SQL to MongoDB translation PDF put together by Rick Osborne and also his ☞ post providing some more details. As regards the SQL and MapReduce comparison, here’s what Rick has to say: It seems kindof silly to go through all this, right? SQL does all of this, but with much less complexity. I’d also like to share something that I’ve learned lately: SQL parallel execution is supported in different forms by some RDBMS.

Microsoft & .NET

Comparison. Scalability. MongoDB. Cloudera Developer Training for Apache Hadoop - London -... Invalid quantity. Please enter a quantity of 1 or more. The quantity you chose exceeds the quantity available. Please enter your name. Please enter an email address. Please enter a valid email address. Please enter your message or comments. Please enter the code as shown on the image. Please select the date you would like to attend. Please enter a valid email address in the To: field. Please enter a subject for your message. Please enter a message. You can only send this invitations to 10 email addresses at a time. $$$$ is not a properly formatted color. Please limit your message to $$$$ characters. $$$$ is not a valid email address. Please enter a promotional code. Sold Out Unavailable You have exceeded the time limit and your reservation has been released. The purpose of this time limit is to ensure that registration is available to as many people as possible. This option is not available anymore.

Please read and accept the waiver. All fields marked with * are required. US Zipcodes need to be 5 digits.

Hadoop

Building a Better Submission Form. If you participated in our invitation to photograph a “Moment in Time” earlier in May, you used our new photo submission software, which we call Stuffy. Built to enable users to upload media files — and to allow our producers to review uploaded files quickly — Stuffy uses a “NoSQL” storage engine to make customized forms simple. Submission Form The original photo uploader, called Puffy — the Photo Upload Form For You — had hundreds of lines of code for a single custom form. Over time, that single form turned into multiple forms, as we met internal demands for the tool. On the back end, the original application used a MySQL database, requiring somewhat complex SQL to generate each form and its submissions.

In the application, a form serves two purposes: to tell the application what fields to display in the form, and to collect submissions for the application. A Different Approach Displaying a photo submission form now requires a single lookup. Beyond Uploading. The Net Takeaway: SQL and Hadoop. SQL and Hadoop · 11/20/2008 12:23 PM, Database Analysis I don’t know why there is so much confusion over the role of MapReduce oriented databases like Hadoop vs.

SQL oriented databases. It’s actually pretty simple. There are 2 things people want to do with databases: Select and Aggregate/Report, aka Process. The Select portion is filtering: finding specific data points based on attributes like time, category, etc. The Aggregate/Report is the most common form of data processing: once you have all those rows, you want to do something with them. So, how do we tell databases to do these 2 things? While some programmers immediately get what SQL can do, others find it to be “YAL”, “Yet Another Language”. MapReduce is a programming concept that’s been around for a while in the object-oriented world, but has recently become more popular as scripting languages rise and as processors become more parallel. So, why the sturm und drang? But we aren’t there yet. SQL vs. So, read each SQL vs.

More soon… MyNoSQL • NoSQL Databases and Polyglot Persistence: A Curated Guide. Trinity. General purpose graph computation faces a great challenge of random data access. Meanwhile, the RAM capacity limit forms a scale bound of single machine solutions for general purpose graph processing. Trinity is a general purpose distributed graph system over a memory cloud. Memory cloud is a globally addressable, in-memory key-value store over a cluster of machines.

Through the distributed in-memory storage, Trinity provides fast random data access power over a large data set. This makes Trinity a natural large graph processing platform. Features of Trinity: Trinity can run in both embedded (in-process) and distributed mode. Project Contacts Bin Shao Jeff Chen Wei-Ying Ma. Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB. Nodechat.js – Using node.js, backbone.js, socket.io, and redis to make a real time chat app. Geek fun: take node.js and a NoSQL database — usually it is MongoDB, CouchDB, or Redis, but adventurous types could even try Riak, HBase, or Cassandra — and create a “real-time” chat or collaborative editor: nodechat.js is a simple, realtime chat app that leverages node.js, backbone.js, socket.IO, and redis. I wrote it as an exercise and I am sharing it becuase there are relatively few working examples using all these pieces together. The outcome?

You can get used with the basics of these cool technologies. Update: A node.js, socket.io, and CouchDB post. Original title and link: nodechat.js – Using node.js, backbone.js, socket.io, and redis to make a real time chat app (NoSQL databases © myNoSQL) NoSQL Databases: What, Why, and When. How Digg is Built? Using a Bunch of NoSQL technologies. The picture should speak for Digg’s polyglot persistency approach: But here is also a description of the data stores in use: Digg stores data in multiple types system depending on the type of data and the access patterns, and also for historical reasons in some cases :)Cassandra: The primary store for “Object-like” access patterns for such things as Items (stories), Users, Diggs and the indexes that surround them.

Since the Cassandra 0.6 version we use does not support secondary indexes, these are computed by application logic and stored here. […]HDFS: Logs from site and API events, user activity. Data source and destination for batch jobs run with Map-Reduce and Hive in Hadoop. Big Data and Big Compute! MySQL: This is mainly the current store for the story promotion algorithm and calculations, because it requires lots of JOIN heavy operations which is not a natural fit for the other data stores at this time. I know this will sound strange, but isn’t it too much in there? @antirez. Performance: Scaling Strategies for ASP.NET Applications. Performance Scaling Strategies for ASP.NET Applications Richard Campbell and Kent Alstad As ASP.NET performance advisors, we are typically brought into a project when it's already in trouble. In many cases, the call doesn't come until after the application has been put into production.

What worked great for the developers isn't working well for users. The complaint: the site is too slow. Management wants to know why this wasn't discovered in testing. Some of the busiest Web sites in the world run on ASP.NET. The Performance Equation In September 2006, Peter Sevcik and Rebecca Wetzel of NetForecast published a paper called "Field Guide to Application Delivery Systems. " Figure 1 The Original Performance Equation (Click the image for a larger view) Figure 2 The Web Version of the Performance Equation (Click the image for a larger view) Now that you have the formula, the challenge lies in measuring each element.

That leaves Cs and Cc, which need some additional development effort. Load Balancing. NoRM. Downloads - Cloudera Support. S Hadoop Demo VM - Cloudera Support. The page you were looking for has a similar name to the following pages: Page: Cloudera's Hadoop Demo VM CDH3u3 (Overview) Last Updated: March 2012 CDH version: CDH3u3 To make it easy for you to get started with Apache Hadoop, we created a set of virtual machines with everything you need. Our VM runs CentOS 5.... Page: Cloudera's Hadoop Demo VM - chd3u1 (Overview) Last Updated: August 2011 CDH version: CDH3u1 To make it easy for you to get started with Apache Hadoop, we created a set of virtual machines with everything you need. Page: Cloudera's Hadoop Demo VM for CDH4 (Overview) Last Updated: October 2012 CDH version: CDH4.1.1 To make it easy for you to get started with Apache Hadoop, we created a set of virtual machines with everything you need. Page: Cloudera's Hadoop Demo VM for CDH3u4 (Overview) Looking for CDH4 VMs?

Page: Cloudera's Hadoop Demo VM for CDH4.0.1 (Overview) Looking for CDH3 VMs? An example of using F# and C# (.net/mono) with Amazon’s Elastic Mapreduce (Hadoop) Feb 07 This posting gives an an example how F# and C# can scale potentially to up to thousands of machines with Mapreduce in order to efficiently process TeraByte (TB) and PetaByte (PB) data amounts. It shows a C# (c sharp) mapper function and a F# (f sharp) reducer function with a description on how to deploy the job on Amazon’s Elastic Mapreduce using bootstrap action (it was tested with an elastic mapreduce cluster of 10 machines).

The .net environment used is mono 2.8 and FSharp 2.0. The code described in this posting can be found on Mapreduce Code C# mapper Compiling c# code F# reducer Compiling fsharp code Deployment on Amazon’s Elastic Mapreduce In order to run mono/.net code on Elastic Mapreduce (Debian) Linux nodes you need to install mono on each node, this can be done with a bootstrap action shell script. Bootstrap action shell script for installing mono Python script to deploy mapreduce and check status until it is done Leave a Reply.

Distributed Systems - Google Code University - Google Code.