machine learning in Python — scikit-learn v0.9 documentation "We use scikit-learn to support leading-edge basic research [...]" "I think it's the most well-designed ML package I've seen so far." "scikit-learn's ease-of-use, performance and overall variety of algorithms implemented has proved invaluable [...]." "For these tasks, we relied on the excellent scikit-learn package for Python." "The great benefit of scikit-learn is its fast learning curve [...]" "It allows us to do AWesome stuff we would not otherwise accomplish" "scikit-learn makes doing advanced analysis in Python accessible to anyone."
Mining of Massive Datasets The book has a new Web site www.mmds.org. This page will no longer be maintained. Your browser should be automatically redirected to the new site in 10 seconds. --- Jure Leskovec, Anand Rajaraman (@anand_raj), and Jeff Ullman Download Version 2.1 The following is the second edition of the book, which we expect to be published soon. There is a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice, rather than how it was described in the original paper. Version 2.1 adds Section 10.5 on finding overlapping communities in social graphs. Download the Latest Book (511 pages, approximately 3MB) Download chapters of the book: Download Version 1.0 The following materials are equivalent to the published book, with errata corrected to July 4, 2012. Download the Book as Published (340 pages, approximately 2MB) Gradiance Support Students who want to use the Gradiance system for self-study can register at www.gradiance.com/services. Other Stuff
Sampling Distribution of Difference Between Means Sampling Distribution of Difference Between Means Author(s) David M. Lane Prerequisites Sampling Distributions, Sampling Distribution of the Mean, Variance Sum Law I Learning Objectives State the mean and variance of the sampling distribution of the difference between means Compute the standard error of the difference between means Compute the probability of a difference between means being above a specified value The sampling distribution of the difference between means can be thought of as the distribution that would result if we repeated the following three steps over and over again: (1) sample n1 scores from Population 1 and n2 scores from Population 2, (2) compute the means of the two samples (M1 and M2), and (3) compute the difference between means, M1 - M2. As you might expect, the mean of the sampling distribution of the difference between means is: which says that the mean of the distribution of differences between sample means is equal to the difference between population means.
Pattern Pattern is a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. The module is free, well-document and bundled with 50+ examples and 350+ unit tests. Download Installation Pattern is written for Python 2.5+ (no support for Python 3 yet). To install Pattern so that the module is available in all Python scripts, from the command line do: > cd pattern-2.6 > python setup.py install If you have pip, you can automatically download and install from the PyPi repository: If none of the above works, you can make Python aware of the module in three ways: Quick overview pattern.web pattern.en The pattern.en module is a natural language processing (NLP) toolkit for English. pattern.search pattern.vector Case studies
SourceForge.net: boost your Machine Learning projects - Project Web Hosting - Open Source Software PyBrain Videos This video presentation was shown at the ICML Workshop for Open Source ML Software on June 25, 2010. It explains some of the features and algorithms of PyBrain and gives tutorials on how to install and use PyBrain for different tasks. This video shows some of the learning features in PyBrain in action. Algorithms We implemented many useful standard and advanced algorithms in PyBrain, and in some cases created interfaces to existing libraries (e.g. Supervised Learning Back-PropagationR-PropSupport-Vector-Machines (LIBSVM interface) Evolino Unsupervised Learning K-Means ClusteringPCA/pPCALSH for Hamming and Euclidean SpacesDeep Belief Networks Reinforcement Learning Value-based Q-Learning (with/without eligibility traces)SARSANeural Fitted Q-iteration Policy Gradients REINFORCENatural Actor-Critic Exploration Methods Epsilon-Greedy Exploration (discrete)Boltzmann Exploration (discrete)Gaussian Exploration (continuous)State-Dependent Exploration (continuous) Black-box Optimization Networks Tools
oluolu - Project Hosting on Google Code Oluolu is a open source query log mining tool which works on Hadoop. This tool provides resources to add new features to search engines. Concretely Oluolu supports automatic dictionary creation such as spelling correction, context queries or frequent query n-grams from query log data. The dictionaries are applied to search engines to add features such as 'did you mean' or 'related keyword suggestion' service in search engines. 2011-11-16 oluolu 0.2.1 released Issue 5 (conf directory is missing) Issue 7 (no output) 2011-05-11 oluolu 0.2.0 released added new parameter -inputLanguage. 2010-10-12 oluolu 0.1.4rc2 released 2010-06-09 oluolu 0.1.2 released added a new parameter, '-showScore' to output the confidence socres for the elements in related query dictionary 2010-04-26 oluolu 0.1.1 released fixed a bug (setting for the number of reducers is not activated) 2010-02-08 oluolu 0.1 released Spelling correction dictionary Context dictionary
Generalized linear model In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Intuition Ordinary linear regression predicts the expected value of a given unknown quantity (the response variable, a random variable) as a linear combination of a set of observed values (predictors). However, these assumptions are inappropriate for many types of response variables. Similarly, a model that predicts a probability of making a yes/no choice (a Bernoulli variable) is even less suitable as a linear-response model, since probabilities are bounded on both ends (they must be between 0 and 1). Overview Model components 1. as
Home | The IATI Standard Alpha version Please note that the Datastore is currently in its first release. Therefore, data queries may sometimes result in unexpected results. We appreciate your understanding. What is the IATI Datastore? The IATI Datastore is an online service that gathers all data published to the IATI standard into a single queryable source. How does it work? Data that is recorded on the IATI Registry, and is valid against the standard, is pulled into the Datastore on a nightly basis. Who is it for? The store is a service for analysts, data journalists, infomediaries and developers. Why a store? This repository is called a store, not a database, because it cannot be used as a single dataset. How to access the Datastore¶ An API is available that enables people to construct queries.For those wishing to just access the data in CSV format, an online form is available to assist with queries Are there any limitations on the Datastore?