Data Mining

Algorithmic Thoughts - Artificial Intelligence. In an earlier post, I had described how DBSCAN is way more efficient(in terms of time) at clustering than K-Means clustering.

It turns out that there is a modified K-Means algorithm which is far more efficient than the original algorithm. The algorithm is called Mini Batch K-Means clustering. It is mostly useful in web applications where the amount of data can be huge, and the time available for clustering maybe limited. The Problem Suppose we have a dataset of 500000 records, and we want to divide them into 100 clusters. The Idea The idea of the algorithm is to represent the dataset by a smaller subset of the data.

The Algorithm The algorithm takes small batches(randomly chosen) of the dataset for each iteration. The graph below shows a comparison of Stochastic Gradient Descent(SGD) K-Means, Mini Batch K-Means and Batch K-Means. It is important to notice that this graph is for a particular value of K. I have written the Mini Batch K-Means algorithm in C. Like this: Running Hadoop On Ubuntu Linux (Single-Node Cluster) In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Running Hadoop On Ubuntu Linux (Single-Node Cluster)

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

This tutorial has been tested with the following software versions: Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04) Hadoop 1.0.3, released May 2012 Sun Java 6 Disabling IPv6. Apache Mahout: Scalable machine learning and data mining. Pier Luca Lanzi. Statistical Data Mining Tutorials. Advertisment: In 2006 I joined Google.

We are growing a Google Pittsburgh office on CMU's campus. We are hiring creative computer scientists who love programming, and Machine Learning is one the focus areas of the office. We're also currently accepting resumes for Fall 2008 intenships. If you might be interested, feel welcome to send me email: awm@google.com . The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms. These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. I hope they're useful (and please let me know if they are, or if you have suggestions or error-corrections).