background preloader

Data science (aka Data mining)

Facebook Twitter

Data mining, forecasting and bioinformatics competitions on Kaggle. All entries. Mining of Massive Datasets. The book has a new Web site

Mining of Massive Datasets

This page will no longer be maintained. Your browser should be automatically redirected to the new site in 10 seconds. The book has now been published by Cambridge University Press. The publisher is offering a 20% discount to anyone who buys the hardcopy Here. By agreement with the publisher, you can still download it free from this page. --- Jure Leskovec, Anand Rajaraman (@anand_raj), and Jeff Ullman Download Version 2.1 The following is the second edition of the book, which we expect to be published soon.

There is a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice, rather than how it was described in the original paper. Version 2.1 adds Section 10.5 on finding overlapping communities in social graphs. Download the Latest Book (511 pages, approximately 3MB) Download chapters of the book: Download Version 1.0 Download the Book as Published (340 pages, approximately 2MB) Gradiance Support. Haystack Group. Home - CKAN. Making Data Social.


Toolboxes. Datasets. Google Data Science tools. Data pre-processing and cleansing. Oluolu - Project Hosting on Google Code. Oluolu is a open source query log mining tool which works on Hadoop.

oluolu - Project Hosting on Google Code

This tool provides resources to add new features to search engines. Concretely Oluolu supports automatic dictionary creation such as spelling correction, context queries or frequent query n-grams from query log data. The dictionaries are applied to search engines to add features such as 'did you mean' or 'related keyword suggestion' service in search engines. 2011-11-16 oluolu 0.2.1 released Issue 5 (conf directory is missing) Issue 7 (no output) 2011-05-11 oluolu 0.2.0 released added new parameter -inputLanguage. 2010-10-12 oluolu 0.1.4rc2 released 2010-06-09 oluolu 0.1.2 released added a new parameter, '-showScore' to output the confidence socres for the elements in related query dictionary 2010-04-26 oluolu 0.1.1 released fixed a bug (setting for the number of reducers is not activated) 2010-02-08 oluolu 0.1 released Spelling correction dictionary Context dictionary.

Pattern. Pattern is a web mining module for the Python programming language.


It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. The module is free, well-document and bundled with 50+ examples and 350+ unit tests. Download Installation Pattern is written for Python 2.5+ (no support for Python 3 yet). To install Pattern so that the module is available in all Python scripts, from the command line do: > cd pattern-2.6 > python install If you have pip, you can automatically download and install from the PyPi repository: If none of the above works, you can make Python aware of the module in three ways: Quick overview pattern.web pattern.en The pattern.en module is a natural language processing (NLP) toolkit for English. Special Online Collection: Dealing with Data.