Data science (aka Data mining)
The book has now been published by Cambridge University Press.
Google Data Science tools
Data pre-processing and cleansing
oluolu - Project Hosting on Google Code Oluolu is a open source query log mining tool which works on Hadoop.
Pattern is a web mining module for the Python programming language. It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification ( k -means, k -NN, SVM), and data visualization (graph networks).
Special Online Collection: Dealing with Data