background preloader

BigData

Facebook Twitter

What’s the difference between big data and business analytics? I offend people daily.

What’s the difference between big data and business analytics?

People tell me they do “big data” and that they’ve been doing big data for years. Their argument is that they’re doing business analytics on a larger and larger scale, so surely by now it must be “big data”. No. There’s an essential difference between true big data techniques, as actually performed at surprisingly few firms but exemplified by Google, and the human-intervention data-driven techniques referred to as business analytics. No matter how big the data you use is, at the end of the day, if you’re doing business analytics, you have a person looking at spreadsheets or charts or numbers, making a decision after possibly a discussion with 150 other people, and then tweaking something about the way the business is run.

If you’re really doing big data, then those 150 people probably get fired laid off, or even more likely are never hired in the first place, and the computer is programmed to update itself via an optimization method. Like this: Like Loading... MapReduce Algorithms - Understanding Data Joins Part 1. In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time discussing data joins.

MapReduce Algorithms - Understanding Data Joins Part 1

While we are going to discuss the techniques for joining data in Hadoop and provide sample code, in most cases you probably won’t be writing code to perform joins yourself. Instead, joining data is better accomplished using tools that work at a higher level of abstraction such as Hive or Pig. Why take the time to learn how to join data if there are tools that can take care of it for you? Running Map-Reduce Job in Apache Hadoop (Multinode Cluster) We will describe here the process to run MapReduce Job in Apache Hadoop in multinode cluster.

Running Map-Reduce Job in Apache Hadoop (Multinode Cluster)

To set up Apache Hadoop in Multinode Cluster, one can read Setting up Apache Hadoop Multi – Node Cluster. For setting up we have to configure the hadoop with the following in each machine: Add the following property in conf/mapred-site.xml in all the nodes N.B. The last three are additional setting, so we can omit them. The Gutenberg Project For our demo purpose of MapReduce we will be using the WordCount example job which reads text files and counts how often words occur. Download the example inputs from the following sites, and all e-texts should be in plain text us-ascii encoding. The Outline of Science, Vol. 1 (of 4) by J. Please google for those texts. REINVENTING SOCIETY IN THE WAKE OF BIG DATA. What those breadcrumbs tell is the story of your life.

REINVENTING SOCIETY IN THE WAKE OF BIG DATA

It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy. Big data is increasingly about real behavior, and by analyzing this sort of data, scientists can tell an enormous amount about you.

They can do this because the sort of person you are is largely determined by your social context, so if I can see some of your behaviors, I can infer the rest, just by comparing you to the people in your crowd. What is Big Data - Theory to Implementation. Hadoop Deep Dive: HDFS and MapReduce. Following my initial introduction to Hadoop and overview of Hadoop components, I studied the Yahoo Hadoop tutorial, and have a deeper understanding of Hadoop.

Hadoop Deep Dive: HDFS and MapReduce

I would like to share my learning and help others understand Hadoop. Why does Hadoop require HDFS, what's wrong with NFS? Nothing! NFS been around for years and is incredibly stable. A distributed file system, such as NFS, provides the functionality required for servers to share files. For Hadoop applications that require highly available distributed file systems with unlimited capacity HDFS is preferred. Hadoop Virtual Panel. Today, Big Data and Hadoop are taking computer industry by storm.

Hadoop Virtual Panel

Its usage is on the mind of everyone, from CEO, to CIO, to developers. According to Wikipedia: “Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.[1] It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. The entire Apache Hadoop “platform” is now commonly considered to consist of the Hadoop kernel, MapReduce and HDFS, as well as a number of related projects – including Apache Hive, Apache Hbase, and others.”

Unfortunately this definition does not really explain either what Hadoop is or what is it role in the enterprise. The participants: The questions: Ten Common Hadoopable Problems: Real-World Hadoop Use Cases. Description Apache Hadoop, the popular data storage and analysis platform, has generated a great deal of interest recently.

Ten Common Hadoopable Problems: Real-World Hadoop Use Cases

Large and successful companies are using it to do powerful analyses of the data they collect. Welcome to Apache™ Hadoop®!