The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users
“Ten Simple Rules for Reproducible Computational Research” is a freely available paper on PLOS computational biology. As I’m currently very interested on the subject of reproducible data analysis, I will these ten rules and the possible implementation in R with my point of view of epidemiologist interested in healthcare data reuse. I will also check if my workflow comply with these rules.
Factor Analysis
This section covers principal components and factor analysis. The later includes both exploratory and confirmatory methods. Principal Components The princomp( ) function produces an unrotated principal component analysis.

Airbnb New User Bookings, Winner’s Interview: 3rd place: Sandro Vega Pons
AirBnB New User Bookings was a popular recruiting competition that challenged Kagglers to predict the first country where a new user would book travel. This was the first recruiting competition on Kaggle with scripts enabled. AirBnB encouraged participants to prove their chops through their collaboration and code sharing in addition to their final models. Sandro Vega Pons took 3rd place, ahead of 1,462 other competitors, using an ensemble of GradientBoosting, MLP, a RandomForest, and an ExtraTreesClassifier.

Version Control, File Sharing, and Collaboration Using GitHub and RStudio
This is Part 3 of our “Getting Started with R Programming” series. For previous articles in the series, click here: Part 1, Part 2. This week, we are going to talk about using git and GitHub with RStudio to manage your projects.

Self-Organising Maps for Customer Segmentation using R
Self-Organising Maps (SOMs) are an unsupervised data visualisation technique that can be used to visualise high-dimensional data sets in lower (typically 2) dimensional representations. In this post, we examine the use of R to create a SOM for customer segmentation. The figures shown here used use the 2011 Irish Census information for the greater Dublin area as an example data set. This work is based on a talk given to the Dublin R Users group in January 2014. If you are keen to get down to business:

How To Perform A Logistic Regression In R
Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values.
How to combine multiple CSV files into one using CMD -
EmailEmail This is a trick which can save you a lot of time when working with a dataset spread across multiple CSV files. Using a simple CMD command it is possible to combine all the CSV’s into a single entity ready for all your pivot and table wizardry. Step 1 Save all of the CSV files into a single folder.

Comparison of String Distance Algorithms
For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually.

Variable Selection Procedures - The LASSO
The LASSO (Least Absolute Shrinkage and Selection Operator) is a method of automatic variable selection which can be used to select predictors X* of a target variable Y from a larger set of potential or candidate predictors X. Developed in 1996 by Tibshirani, the LASSO formulates curve fitting as a quadratic programming problem, where the objective function penalizes the absolute size of the regression coefficients, based on the value of a tuning parameter λ. In doing so, the LASSO can drive the coefficients of irrelevant variables to zero, thus performing automatic variable selection. This post features a toy example illustrating tactics in variable selection with the lasso. The post also dicusses the issue of consistency – how we know from a large sample perspective that we are honing in on the true set of predictors when we apply the LASSO. My take is a two-step approach is often best.

A wrapper around nested ifelse
The ifelse function is the way to do vectorised if then else in R. One of the first cool things I learned to do in R a few years back, I got from Norman Matloff’s The Art of R Programming. When you have more than one if then statements, you just nest multiple ifelse functions before you reach the else. set.seed(0310)x <- runif(1000, 1, 20)y <- runif(1000, 1, 20) the_old_way <- ifelse(x < 5 & y < 5, 'A', ifelse(x < 5 & y < 15, 'B', ifelse(x < 5, 'C', ifelse(x < 15 & y < 5, 'D', ifelse(x < 15 & y < 15, 'E', ifelse(y < 5, 'F', ifelse(y < 15, 'G', 'H'))))))) Although this is very functional and fast, it is not exactly pretty.
Technical Tidbits From Spatial Analysis & Data Science
Even the most experienced R users need help creating elegant graphics. The ggplot2 library is a phenomenal tool for creating graphics in R but even after many years of near-daily use we still need to refer to our Cheat Sheet. Up until now, we’ve kept these key tidbits on a local PDF. But for our own benefit (and hopefully yours) we decided to post the most useful bits of code.