background preloader

Machine Learning

Facebook Twitter

Standardisation

Easy Cross Validation in R with `modelr` · I'm Jacob. Ensembles Of ML Algos. Observation and Performance Window - Listen Data. The first step of building a predictive model is to define a target variable.

Observation and Performance Window - Listen Data

For that we need to define the observation and performance window. Observation Window It is the period from where independent variables /predictors come from. In other words, the independent variables are created considering this period (window) only. Performance Window It is the period from where dependent variable /target come from. Vtreat 0.5.27 released on CRAN – Win-Vector Blog. Posted on Author John MountCategories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags R, vtreat Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN. vtreat is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

vtreat 0.5.27 released on CRAN – Win-Vector Blog

(from the package documentation) Very roughly vtreat accepts an arbitrary “from the wild” data frame (with different column types, NAs, NaNs and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of NA, NaNs, infinities, and so on) ready for predictive modeling. We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of vtreat).

Practicing Machine Learning Techniques in R with MLR Package. Introduction In R, we often use multiple packages for doing various machine learning tasks.

Practicing Machine Learning Techniques in R with MLR Package

For example: we impute missing value using one package, then build a model with another and finally evaluate their performance using a third package. The problem is, every package has a set of specific parameters. While working with many packages, we end up spending a lot of time to figure out which parameters are important. Don’t you think? To solve this problem, I researched and came across a R package named MLR, which is absolutely incredible at performing machine learning tasks. Using caret to compare models. By Joseph Rickert The model table on the caret package website lists more that 200 variations of predictive analytics models that are available withing the caret framework.

Using caret to compare models

All of these models may be prepared, tuned, fit and evaluated with a common set of caret functions. All on its own, the table is an impressive testament to the utility and scope of the R language as data science tool. For the past year or so xgboost, the extreme gradient boosting algorithm, has been getting a lot of attention. The code below compares gbm with xgboost using the segmentationData set that comes with caret.

Cross-Validation for Predictive Analytics Using R - MilanoR. Implementation of 19 Regression algorithms in R using CPU performance data. - Data Science-Zing.

Feature Selection

Yet Another Blog in Statistical Computing. Vik's Blog - Writings on machine learning, data science, and other cool stuff. Bagging, aka bootstrap aggregation, is a relatively simple way to increase the power of a predictive statistical model by taking multiple random samples(with replacement) from your training data set, and using each of these samples to construct a separate model and separate predictions for your test set.

Vik's Blog - Writings on machine learning, data science, and other cool stuff

These predictions are then averaged to create a, hopefully more accurate, final prediction value. One can quickly intuit that this technique will be more useful when the predictors are more unstable. In other words, if the random samples that you draw from your training set are very different, they will generally lead to very different sets of predictions. This greater variability will lead to a stronger final result. When the samples are extremely similar, all of the predictions derived from the samples will likewise be extremely similar, making bagging a bit superfluous. Okay, enough theoretical framework.

Confidence Intervals for Random Forests. By Joseph Rickert Random Forests, the "go to" classifier for many data scientists, is a fairly complex algorithm with many moving parts that introduces randomness at different levels.

Confidence Intervals for Random Forests

Understanding exactly how the algorithm operates requires some work, and assessing how good a Random Forests model fits the data is a serious challenge. In the pragmatic world of machine learning and data science, assessing model performance often comes down to calculating the area under the ROC curve (or some other convenient measure) on a hold out set of test data. If the ROC looks good then the model is good to go. Fortunately, however, goodness of fit issues have a kind of nagging persistence that just won't leave statisticians alone. Compare The Performance of Machine Learning Algorithms in R.

How do you compare the estimated accuracy of different machine learning algorithms effectively?

Compare The Performance of Machine Learning Algorithms in R

In this post you will discover 8 techniques that you can use to compare machine learning algorithms in R. You can use these techniques to choose the most accurate model, and be able to comment on the statistical significance and the absolute amount it beat out other algorithms. Let’s get started. Compare The Performance of Machine Learning Algorithms in RPhoto by Matt Reinbold, some rights reserved.

Choose The Best Machine Learning Model How do you choose the best model for your problem? Predicting wine quality using Random Forests. Anomaly Detection in R. Introduction Inspired by this Netflix post, I decided to write a post based on this topic using R.

Anomaly Detection in R

There are several nice packages to achieve this goal, the one we´re going to review is AnomalyDetection. Download full -and tiny- R code of this post here. Normal Vs. Abnormal The definition for abnormal, or outlier, is an element which does not follow the behaviour of the majority. Data has noise, same example as a radio which doesn't have good signal, and you end up listening to some background noise. The orange section could be noise in data, since it oscillates around a value without showing a defined pattern, in other words: White noiseAre the red circles noise or they are peaks from an undercover pattern? A good algorithm can detect abnormal points considering the inner noise and leaving it behind. Hands on anomaly detection! In this example, data comes from the well known wikipedia, which offers an API to download from R the daily page views given any {term + language}. Kickin’ it with elastic net regression – On the lambda. With the kind of data that I usually work with, overfitting regression models can be a huge problem if I'm not careful.

Kickin’ it with elastic net regression – On the lambda

Ridge regression is a really effective technique for thwarting overfitting. It does this by penalizing the L2 norm (euclidean distance) of the coefficient vector which results in "shrinking" the beta coefficients. The aggressiveness of the penalty is controlled by a parameter λ. Lasso regression is a related regularization method. Instead of using the L2 norm, though, it penalizes the L1 norm (manhattan distance) of the coefficient vector.

Self-Organising Maps for Customer Segmentation using R. Self-Organising Maps (SOMs) are an unsupervised data visualisation technique that can be used to visualise high-dimensional data sets in lower (typically 2) dimensional representations.

Self-Organising Maps for Customer Segmentation using R

In this post, we examine the use of R to create a SOM for customer segmentation. The figures shown here used use the 2011 Irish Census information for the greater Dublin area as an example data set. Using C4.5 to predict Diabetes in Pima Indian Women. C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan in 1993. C4.5 is an extension of Quinlan's earlier ID3 algorithm. Down-Sampling Using Random Forests — Applied Predictive Modeling. The down-side to down-sampling is that information in the majority classes is being thrown away and this situation becomes more acute as the class imbalance becomes more severe. Random forest models have the ability to use down-sampling without data loss. Recall that random forests is a tree ensemble method. A large number of bootstrap samples are taken form the training data and a separate unpruned tree is created for each data set.

This model contains another feature that randomly samples a subset of predictors at each split to encourage diversity of the resulting trees. When predicting a new sample, a prediction is produced by every tree in the forest and these results are combined to generate a single prediction for an individual sample. Random forests (and bagging) use bootstrap sampling. To incorporate down-sampling, random forest can take a random sample of size c*nmin, where c is the number of classes and nmin is the number of samples in the minority class.