machinelearning

TwitterFacebook
Get flash to fully experience Pearltrees

Projects matching python.

http://mloss.org/software/search/?searchterm=python About: Python module to ease pattern classification analyses of large datasets. It provides high-level abstraction of typical processing steps (e.g. data preparation, classification, feature selection, [...] Changes: This release aggregates all the changes occurred between official releases in 0.4 series and various snapshot releases (in 0.5 and 0.6 series).
About: ELKI is a framework for implementing data-mining algorithms with support for index structures, that includes a wide variety of clustering and outlier detection methods. Changes: The full changelog is not yet up. Here is an excerpt of the new functions in 0.5.0 - further speed improvements - R-Tree flexibility: multiple new split strategies, bulk loaders, insertion strategies, so that ELKI can now do many R-Tree variations, including the original Guttman R-Tree, not only the R*-Tree. - K-Means flexibility: MacQueen and Lloyd style iterations along with various seeding strategies, including K-Means++ - VA-File (static only, not dynamic databases); partial-VA to come for 0.5.0 final? - Many popular cluster evaluation measures - Alpha shapes, Voronoi cells, Delaunay triangulations in the visualization layer (in the projected space, so 2D!)

All entries

http://mloss.org/software/
The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM) . It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS , Liblinear , LibSVM , SVMLight , SVMLin and GPDT . Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved , Fischer , TOP , Spectrum , Weighted Degree Kernel (with shifts) . http://www.shogun-toolbox.org/

shogun | A Large Scale Machine Learning Toolbox

Orange - Data Mining Fruitful & Fun

Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. http://orange.biolab.si/
http://scikit-learn.org/stable/ Easy-to-use and general-purpose machine learning in Python scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages ( numpy , scipy , matplotlib ). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering .

machine learning in Python — scikit-learn v0.9 documentation

http://pybrain.org/pages/features Videos This video presentation was shown at the ICML Workshop for Open Source ML Software on June 25, 2010. It explains some of the features and algorithms of PyBrain and gives tutorials on how to install and use PyBrain for different tasks. We implemented many useful standard and advanced algorithms in PyBrain, and in some cases created interfaces to existing libraries (e.g. LIBSVM).

PyBrain

http://leon.bottou.org/projects/lasvm LASVM is an approximate SVM solver that uses online approximation. It reaches accuracies similar to that of a real SVM after performing a single sequential pass through the training examples. Further benefits can be achieved using selective sampling techniques to choose which example should be considered next. As show in the graph, LASVM requires considerably less memory than a regular SVM solver. This becomes a considerable speed advantage for large training sets . In fact LASVM has been used to train a 10 class SVM classifier with 8 million examples on a single processor.

projects:lasvm [Léon Bottou]

Multiclass Support Vector Machine | GPU Computing

http://www.gpucomputing.net/?q=node/1281 The scaling of serial algorithms cannot rely on the improvement of CPUs anymore. The performance of classical Support Vector Machine (SVM) implementations has reached its limit and the arrival of the multi core era requires these algorithms to adapt to a new parallel scenario. Graphics Processing Units (GPU) have arisen as high performance platforms to implement data parallel algorithms. In this project, it is described how a naïve implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput. Empirical results show that the training and classification time of the algorithm can be reduced an order of magnitude compared to a classical solver, LIBSVM, while guaranteeing the same accuracy.
Abstract — We propose a new algorithm for the incremental training of Support Vector Machines (SVMs) that is suitable for problems of sequentially arriving data and fast constraint parameter variation. Our method involves using a “warm-start” algorithm for the training of Support Vector Machines (SVMs), which allows us to take advantage of the natural incremental properties of the standard active set approach to linearly constrained optimisation problems. Incremental training involves quickly re-training a support vector machine after adding a small number of additional training vectors to the training set of an existing (trained) support vector machine. Similarly, the problem of fast constraint parameter variation involves quickly re-training an existing support vector machine using the same training set but different constraint parameters. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.126.2603

CiteSeerX — Incremental training of support vector machines

http://www.springerlink.com/content/788fq0p1wkjy63lt/

Abstract

The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.
This paper presents a novel approach for detecting duplicate records in the context of digital gazetteers, using state-of-the-art machine learning techniques. It reports a thorough evaluation of alternative machine learning approaches designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using support vector machines or alternating decision trees with different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an increase in accuracy. The paper also discusses how the proposed duplicate detection approach can scale to large collections, through the usage of filtering or blocking techniques. This work was partially supported by the Fundação para a Ciência e a Tecnologia (FCT), through project grant PTDC/EIA-EIA/109840/2009 (SInteliGIS).

Abstract

MILK: MACHINE LEARNING TOOLKIT — milk 0.3.7 documentation

Milk is a machine learning toolkit in Python. Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems.
em is a package which enables to create Gaussian Mixture Models (diagonal and full covariance matrices supported), to sample them, and to estimate them from data using Expectation Maximization algorithm. It can also draw confidence ellipsoides for multivariate models, and compute the Bayesian Information Criterion to assess the number of clusters in the data. In a near future, I hope to add so-called online EM (ie recursive EM) and variational Bayes implementation. em is implemented in python, and uses the excellent numpy and scipy packages.

Em

Python module for extended Infomax ICA

Independent Componnent Analysis (ICA) is a modern and effective method for performing blind Source separation (also known as Cocktail Party Problem ). Fields of application are artifact reduction in multivariate data (eg EEG or MEG), finding hidden factors in financial data or noise reduction in images. Further ICA can be used to simplify and improve the solution of the inverse source problem in EEG and MEG analysis. As I found no Python module for performing ICA, I wrapped the existing extended infomax implementation from EEGLAB .
There are some large data for which with/without nonlinear mappings gives similar performances. Without using kernels , one can quickly train a much larger set via a linear classifier. Document classification is one such application. In the following example (20,242 instances and 47,236 features; available on LIBSVM data sets ), the cross-validation time is significantly reduced by using LIBLINEAR: % time libsvm-2.85/svm-train -c 4 -t 0 -e 0.1 -m 800 -v 5 rcv1_train.binary Cross Validation Accuracy = 96.8136% 345.569s % time liblinear-1.21/train -c 4 -e 0.1 -v 5 rcv1_train.binary Cross Validation Accuracy = 97.0161% 2.944s Warning: While LIBLINEAR's default solver is very fast for document classification, it may be slow in other situations.

LIBLINEAR -- A Library for Large Linear Classification