
machinelearning
Get flash to fully experience Pearltrees
Projects matching python.
About: ELKI is a framework for implementing data-mining algorithms with support for index structures, that includes a wide variety of clustering and outlier detection methods. Changes: The full changelog is not yet up. Here is an excerpt of the new functions in 0.5.0 - further speed improvements - R-Tree flexibility: multiple new split strategies, bulk loaders, insertion strategies, so that ELKI can now do many R-Tree variations, including the original Guttman R-Tree, not only the R*-Tree. - K-Means flexibility: MacQueen and Lloyd style iterations along with various seeding strategies, including K-Means++ - VA-File (static only, not dynamic databases); partial-VA to come for 0.5.0 final? - Many popular cluster evaluation measures - Alpha shapes, Voronoi cells, Delaunay triangulations in the visualization layer (in the projected space, so 2D!)
All entries
The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM) . It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS , Liblinear , LibSVM , SVMLight , SVMLin and GPDT . Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved , Fischer , TOP , Spectrum , Weighted Degree Kernel (with shifts) .
shogun | A Large Scale Machine Learning Toolbox
Orange - Data Mining Fruitful & Fun
Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning.machine learning in Python — scikit-learn v0.9 documentation
PyBrain
projects:lasvm [Léon Bottou]
Multiclass Support Vector Machine | GPU Computing
Abstract — We propose a new algorithm for the incremental training of Support Vector Machines (SVMs) that is suitable for problems of sequentially arriving data and fast constraint parameter variation. Our method involves using a “warm-start” algorithm for the training of Support Vector Machines (SVMs), which allows us to take advantage of the natural incremental properties of the standard active set approach to linearly constrained optimisation problems. Incremental training involves quickly re-training a support vector machine after adding a small number of additional training vectors to the training set of an existing (trained) support vector machine. Similarly, the problem of fast constraint parameter variation involves quickly re-training an existing support vector machine using the same training set but different constraint parameters.
CiteSeerX — Incremental training of support vector machines
Abstract
The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.This paper presents a novel approach for detecting duplicate records in the context of digital gazetteers, using state-of-the-art machine learning techniques. It reports a thorough evaluation of alternative machine learning approaches designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using support vector machines or alternating decision trees with different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an increase in accuracy. The paper also discusses how the proposed duplicate detection approach can scale to large collections, through the usage of filtering or blocking techniques. This work was partially supported by the Fundação para a Ciência e a Tecnologia (FCT), through project grant PTDC/EIA-EIA/109840/2009 (SInteliGIS).
Abstract
MILK: MACHINE LEARNING TOOLKIT — milk 0.3.7 documentation
Milk is a machine learning toolkit in Python. Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems.em is a package which enables to create Gaussian Mixture Models (diagonal and full covariance matrices supported), to sample them, and to estimate them from data using Expectation Maximization algorithm. It can also draw confidence ellipsoides for multivariate models, and compute the Bayesian Information Criterion to assess the number of clusters in the data. In a near future, I hope to add so-called online EM (ie recursive EM) and variational Bayes implementation. em is implemented in python, and uses the excellent numpy and scipy packages.
Em
Python module for extended Infomax ICA
Independent Componnent Analysis (ICA) is a modern and effective method for performing blind Source separation (also known as Cocktail Party Problem ). Fields of application are artifact reduction in multivariate data (eg EEG or MEG), finding hidden factors in financial data or noise reduction in images. Further ICA can be used to simplify and improve the solution of the inverse source problem in EEG and MEG analysis. As I found no Python module for performing ICA, I wrapped the existing extended infomax implementation from EEGLAB .There are some large data for which with/without nonlinear mappings gives similar performances. Without using kernels , one can quickly train a much larger set via a linear classifier. Document classification is one such application. In the following example (20,242 instances and 47,236 features; available on LIBSVM data sets ), the cross-validation time is significantly reduced by using LIBLINEAR: % time libsvm-2.85/svm-train -c 4 -t 0 -e 0.1 -m 800 -v 5 rcv1_train.binary Cross Validation Accuracy = 96.8136% 345.569s % time liblinear-1.21/train -c 4 -e 0.1 -v 5 rcv1_train.binary Cross Validation Accuracy = 97.0161% 2.944s Warning: While LIBLINEAR's default solver is very fast for document classification, it may be slow in other situations.

