background preloader

Machine Learning

Facebook Twitter

Machine Learning with R: A Complete Guide to Decision Trees - Appsilon. Decision Trees with R Decision trees are among the most fundamental algorithms in supervised machine learning, used to handle both regression and classification tasks.

Machine Learning with R: A Complete Guide to Decision Trees - Appsilon

Support Vector Regression Tutorial for Machine Learning - Analytics Vidhya. Julia Silge - Tuning random forest hyperparameters with #TidyTuesday trees data. I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models.

Julia Silge - Tuning random forest hyperparameters with #TidyTuesday trees data

Today, I’m using a #TidyTuesday dataset from earlier this year on trees around San Francisco to show how to tune the hyperparameters of a random forest model and then use the final best model. Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Explore the data Our modeling goal here is to predict the legal status of the trees in San Francisco in the #TidyTuesday dataset.

This isn’t this week’s dataset, but it’s one I have been wanting to return to. Moodle. The Support Vector Machine algorithm is effective for balanced classification, although it does not perform well on imbalanced datasets.


The SVM algorithm finds a hyperplane decision boundary that best splits the examples into two classes. The split is made soft through the use of a margin that allows some points to be misclassified. By default, this margin favors the majority class on imbalanced datasets, although it can be updated to take the importance of each class into account and dramatically improve the performance of the algorithm on datasets with skewed class distributions. This modification of SVM that weighs the margin proportional to the class importance is often referred to as weighted SVM, or cost-sensitive SVM. In this tutorial, you will discover weighted support vector machines for imbalanced classification.

After completing this tutorial, you will know: Let’s get started. Auditor: a guided tour through residuals. Machine learning is a hot topic nowadays, thus there is no need to convince anyone about its usefulness.

auditor: a guided tour through residuals

ML models are being successfully applied in biology, medicine, finance, and so on. Thanks to modern software, it is easy to train even a complex model that fits the training data and results in high accuracy on the test set. The problem arises when poorly verified model fails confronted with real-world data. In this post, we would like to describe auditor package for visual auditing of residuals of machine learning models.

A residual is the difference between the observed value and the value predicted by a model. Does the model fit the data? TensorFlow for R: So, how come we can use TensorFlow from R? Which computer language is most closely associated with TensorFlow?

TensorFlow for R: So, how come we can use TensorFlow from R?

While on the TensorFlow for R blog, we would of course like the answer to be R, chances are it is Python (though TensorFlow has official bindings for C++, Swift, Javascript, Java, and Go as well). So why is it you can define a Keras model as library(keras) model <- keras_model_sequential() %>% layer_dense(units = 32, activation = "relu") %>% layer_dense(units = 1) (nice with %>%s and all!) – then train and evaluate it, get predictions and plot them, all that without ever leaving R? The short answer is, you have keras, tensorflow and reticulate installed. reticulate embeds a Python session within the R process. This post first elaborates a bit on the short answer.

One note on terminology before we jump in: On the R side, we’re making a clear distinction between the packages keras and tensorflow. So keras, tensorflow, reticulate, what are they for? Introducing mlrPlayground. First of all You may ask yourself how is this name ‘mlrPlayground’ even justified?

Introducing mlrPlayground

What a person dares to put two such opposite terms in a single word and expects people to take him seriously? I assume most of you know ‘mlr’, for those who don’t: It is a framework offering a huge variety of tools for simplifying machine learning tasks in R. Patrick Schratz. The mlr-org team is very proud to present the initial release of the mlr3 machine-learning framework for R. mlr3 comes with a clean object-oriented-design using the R6 class system.

Patrick Schratz

With this, it overcomes the limitations of R’s S3 classes. A Gentle Introduction to tidymodels. By Edgar Ruiz Recently, I had the opportunity to showcase tidymodels in workshops and talks.

A Gentle Introduction to tidymodels

Because of my vantage point as a user, I figured it would be valuable to share what I have learned so far. Let’s begin by framing where tidymodels fits in our analysis projects. The diagram above is based on the R for Data Science book, by Wickham and Grolemund. The version in this article illustrates what step each package covers. iBreakDown plots for Sinking of the RMS Titanic. DALEX for keras and parsnip – DALEX is a set of tools for explanation, exploration and debugging of predictive models.

DALEX for keras and parsnip –

The nice thing about it is that it can be easily connected to different model factories. Recently Michal Maj wrote a nice vignette how to use DALEX with models created in keras (an open-source neural-network library in python with an R interface created by RStudio). Find the vignette here. Michal compared a keras model against deeplearning from h2o package, so you can check which model won on the Titanic dataset. Next nice vignette was created by Szymon Maksymiuk.

Gradient Boosting

Shapper is on CRAN, it’s an R wrapper over SHAP explainer for black-box models – Written by: Alicja Gosiewska In applied machine learning, there are opinions that we need to choose between interpretability and accuracy.

shapper is on CRAN, it’s an R wrapper over SHAP explainer for black-box models –

However in field of the Interpretable Machine Learning, there are more and more new ideas for explaining black-box models. One of the best known method for local explanations is SHapley Additive exPlanations (SHAP). The SHAP method is used to calculate influences of variables on the particular observation. This method is based on Shapley values, a technique borrowed from the game theory. The R package shapper is a port of the Python library shap. While shapper is a port for Python library shap, there are also pure R implementations of the SHAP method, e.g. iml or shapleyR. Installation The shapper wraps up the Python library, therefore installation requires a bit more effort than installation of an ordinary R package. Install the R package shapper.


A tutorial on tidy cross-validation with R - Econometrics and Free Software. Set up Let’s load the needed packages: library("tidyverse") library("tidymodels") library("parsnip") library("brotools") library("mlbench") Load the data, included in the {mlrbench} package: The tidy caret interface in R – poissonisfish. Among most popular off-the-shelf machine learning packages available to R, caret ought to stand out for its consistency. It reaches out to a wide range of dependencies that deploy and support model building using a uniform, simple syntax. I have been using caret extensively for the past three years, with a precious partial least squares (PLS) tutorial in my track record. A couple of years ago, the creator and maintainer of caret Max Kuhn joined RStudio where he has contributing new packages to the ongoing tidy-paranoia – the supporting recipes, yardstick, rsample and many other packages that are part of the tidyverse paradigm and I knew little about.

As it happens, caret is now best used with some of these. As an aspiring data scientist with fashionable hex-stickers on my laptop cover and a tendency to start any sentence with ‘big data’, I set to learn tidyverse and going Super Mario using pipes (%>%, Ctrl + Shift + M). Overall, I found the ‘tidy’ approach quite enjoyable and efficient. Visualize the Business Value of your Predictive Models with modelplotr. Why ROC curves are a bad idea to explain your model to business people Summary In this blog we explain four most valuable evaluation plots to assess the business value of a predictive model. These plots are the cumulative gains, cumulative lift, response and cumulative response. ModelDown: a website generator for your predictive models – I love the pkgdown package.

Anomaly Detection

How to implement Random Forests in R – Imagine you were to buy a car, would you just go to a store and buy the first one that you see? No, right? You usually consult few people around you, take their opinion, add your research to it and then go for the final decision. Let’s take a simpler scenario: whenever you go for a movie, do you ask your friends for reviews about the movie (unless, off-course it stars one of your favorite actress)? GitHub - mljar/mljar-api-R: R wrapper for MLJAR API. The one function call you need to know as a data scientist: h2o.automl.

Introduction Two things that recently came to my attention were AutoML (Automatic Machine Learning) by and the fashion MNIST by Zalando Research. Radial kernel Support Vector Classifier. Random Forests in R.


Easy Cross Validation in R with `modelr` · I'm Jacob. Ensembles Of ML Algos. Observation and Performance Window - Listen Data. The first step of building a predictive model is to define a target variable. For that we need to define the observation and performance window. Observation Window It is the period from where independent variables /predictors come from. In other words, the independent variables are created considering this period (window) only. Performance Window It is the period from where dependent variable /target come from. Practicing Machine Learning Techniques in R with MLR Package. Using caret to compare models. Cross-Validation for Predictive Analytics Using R - MilanoR. Implementation of 19 Regression algorithms in R using CPU performance data. - Data Science-Zing.

Feature Selection

Yet Another Blog in Statistical Computing. Vik's Blog - Writings on machine learning, data science, and other cool stuff. Bagging, aka bootstrap aggregation, is a relatively simple way to increase the power of a predictive statistical model by taking multiple random samples(with replacement) from your training data set, and using each of these samples to construct a separate model and separate predictions for your test set.

These predictions are then averaged to create a, hopefully more accurate, final prediction value. One can quickly intuit that this technique will be more useful when the predictors are more unstable. In other words, if the random samples that you draw from your training set are very different, they will generally lead to very different sets of predictions. This greater variability will lead to a stronger final result. When the samples are extremely similar, all of the predictions derived from the samples will likewise be extremely similar, making bagging a bit superfluous. Okay, enough theoretical framework.

Confidence Intervals for Random Forests. Compare The Performance of Machine Learning Algorithms in R. How do you compare the estimated accuracy of different machine learning algorithms effectively? Predicting wine quality using Random Forests. Kickin’ it with elastic net regression – On the lambda. With the kind of data that I usually work with, overfitting regression models can be a huge problem if I'm not careful. Self-Organising Maps for Customer Segmentation using R. Using C4.5 to predict Diabetes in Pima Indian Women. Down-Sampling Using Random Forests — Applied Predictive Modeling.