background preloader

Cluster Analysis

Cluster Analysis
R has an amazing variety of functions for cluster analysis. In this section, I will describe three of the many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below. Data Preparation Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability. # Prepare Data mydata <- na.omit(mydata) # listwise deletion of missing mydata <- scale(mydata) # standardize variables Partitioning K-means clustering is the most popular partitioning method. # Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). Hierarchical Agglomerative

http://www.statmethods.net/advstats/cluster.html

Related:  RR Resourcesbooks

Factor Analysis This section covers principal components and factor analysis. The later includes both exploratory and confirmatory methods. Principal Components The princomp( ) function produces an unrotated principal component analysis. The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users “Ten Simple Rules for Reproducible Computational Research” is a freely available paper on PLOS computational biology. As I’m currently very interested on the subject of reproducible data analysis, I will these ten rules and the possible implementation in R with my point of view of epidemiologist interested in healthcare data reuse. I will also check if my workflow comply with these rules.

Airbnb New User Bookings, Winner’s Interview: 3rd place: Sandro Vega Pons AirBnB New User Bookings was a popular recruiting competition that challenged Kagglers to predict the first country where a new user would book travel. This was the first recruiting competition on Kaggle with scripts enabled. AirBnB encouraged participants to prove their chops through their collaboration and code sharing in addition to their final models. Sandro Vega Pons took 3rd place, ahead of 1,462 other competitors, using an ensemble of GradientBoosting, MLP, a RandomForest, and an ExtraTreesClassifier.

Self-Organising Maps for Customer Segmentation using R Self-Organising Maps (SOMs) are an unsupervised data visualisation technique that can be used to visualise high-dimensional data sets in lower (typically 2) dimensional representations. In this post, we examine the use of R to create a SOM for customer segmentation. The figures shown here used use the 2011 Irish Census information for the greater Dublin area as an example data set. This work is based on a talk given to the Dublin R Users group in January 2014. If you are keen to get down to business:

Version Control, File Sharing, and Collaboration Using GitHub and RStudio This is Part 3 of our “Getting Started with R Programming” series. For previous articles in the series, click here: Part 1, Part 2. This week, we are going to talk about using git and GitHub with RStudio to manage your projects.

How To Perform A Logistic Regression In R Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values. Comparison of String Distance Algorithms For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually.

How to combine multiple CSV files into one using CMD - EmailEmail This is a trick which can save you a lot of time when working with a dataset spread across multiple CSV files. Using a simple CMD command it is possible to combine all the CSV’s into a single entity ready for all your pivot and table wizardry. Step 1 Save all of the CSV files into a single folder. Variable Selection Procedures - The LASSO The LASSO (Least Absolute Shrinkage and Selection Operator) is a method of automatic variable selection which can be used to select predictors X* of a target variable Y from a larger set of potential or candidate predictors X. Developed in 1996 by Tibshirani, the LASSO formulates curve fitting as a quadratic programming problem, where the objective function penalizes the absolute size of the regression coefficients, based on the value of a tuning parameter λ. In doing so, the LASSO can drive the coefficients of irrelevant variables to zero, thus performing automatic variable selection. This post features a toy example illustrating tactics in variable selection with the lasso. The post also dicusses the issue of consistency – how we know from a large sample perspective that we are honing in on the true set of predictors when we apply the LASSO. My take is a two-step approach is often best.

Technical Tidbits From Spatial Analysis & Data Science Even the most experienced R users need help creating elegant graphics. The ggplot2 library is a phenomenal tool for creating graphics in R but even after many years of near-daily use we still need to refer to our Cheat Sheet. Up until now, we’ve kept these key tidbits on a local PDF. But for our own benefit (and hopefully yours) we decided to post the most useful bits of code. A wrapper around nested ifelse The ifelse function is the way to do vectorised if then else in R. One of the first cool things I learned to do in R a few years back, I got from Norman Matloff’s The Art of R Programming. When you have more than one if then statements, you just nest multiple ifelse functions before you reach the else. set.seed(0310)x <- runif(1000, 1, 20)y <- runif(1000, 1, 20) the_old_way <- ifelse(x < 5 & y < 5, 'A', ifelse(x < 5 & y < 15, 'B', ifelse(x < 5, 'C', ifelse(x < 15 & y < 5, 'D', ifelse(x < 15 & y < 15, 'E', ifelse(y < 5, 'F', ifelse(y < 15, 'G', 'H'))))))) Although this is very functional and fast, it is not exactly pretty.

Related: