Data science (aka Data mining)

Facebook Twitter

Data mining, forecasting and bioinformatics competitions on Kaggle. All entries. Mining of Massive Datasets. The book has now been published by Cambridge University Press.

Mining of Massive Datasets

The publisher is offering a 20% discount to anyone who buys the hardcopy Here. By agreement with the publisher, you can still download it free from this page. Cambridge Press does, however, retain copyright on the work, and we expect that you will obtain their permission and acknowledge our authorship if you republish parts or all of it. We are sorry to have to mention this point, but we have evidence that other items we have published on the Web have been appropriated and republished under other names. It is easy to detect such misuse, by the way, as you will learn in Chapter 3. --- Jure Leskovec, Anand Rajaraman (@anand_raj), and Jeff Ullman Download Version 2.1 The following is the second edition of the book, which we expect to be published soon.

There is a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice, rather than how it was described in the original paper. Www.cs.dartmouth.edu/~ac/Teach/CS85-Fall09/Notes/lecnotes.pdf. Haystack Group. Home - CKAN. Making Data Social.

Visualization

Toolboxes. Datasets. Google Data Science tools. Data pre-processing and cleansing. Oluolu - Project Hosting on Google Code. Oluolu is a open source query log mining tool which works on Hadoop.

oluolu - Project Hosting on Google Code

This tool provides resources to add new features to search engines. Concretely Oluolu supports automatic dictionary creation such as spelling correction, context queries or frequent query n-grams from query log data. The dictionaries are applied to search engines to add features such as 'did you mean' or 'related keyword suggestion' service in search engines. 2011-11-16 oluolu 0.2.1 released Issue 5 (conf directory is missing) Issue 7 (no output) 2011-05-11 oluolu 0.2.0 released added new parameter -inputLanguage. 2010-10-12 oluolu 0.1.4rc2 released 2010-06-09 oluolu 0.1.2 released added a new parameter, '-showScore' to output the confidence socres for the elements in related query dictionary 2010-04-26 oluolu 0.1.1 released fixed a bug (setting for the number of reducers is not activated) 2010-02-08 oluolu 0.1 released Spelling correction dictionary Context dictionary.

Pattern. Pattern is a web mining module for the Python programming language.

Pattern

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. The module is free, well-document and bundled with 50+ examples and 350+ unit tests. Download Installation Pattern is written for Python 2.5+ (no support for Python 3 yet).

To install Pattern so that the module is available in all Python scripts, from the command line do: > cd pattern-2.6 > python setup.py install If you have pip, you can automatically download and install from the PyPi repository: If none of the above works, you can make Python aware of the module in three ways: Quick overview pattern.web pattern.en The pattern.en module is a natural language processing (NLP) toolkit for English.

Special Online Collection: Dealing with Data.