background preloader

Cluster Analysis

Cluster Analysis
R has an amazing variety of functions for cluster analysis. In this section, I will describe three of the many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below. Data Preparation Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability. # Prepare Data mydata <- na.omit(mydata) # listwise deletion of missing mydata <- scale(mydata) # standardize variables Partitioning K-means clustering is the most popular partitioning method. # Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). Hierarchical Agglomerative

Related:  RR Resourcesbooks

Factor Analysis This section covers principal components and factor analysis. The later includes both exploratory and confirmatory methods. Principal Components The princomp( ) function produces an unrotated principal component analysis. The PValues Data Table The PValues data table contains a row for each pair of Y and X variables. If you specified a column for Group, the PValues data table contains a first column called Group. A row appears for each level of the Group column and for each pair of Y and X variables. The PValues data table also contains a table variable called Original Data that gives the name of the data table that was used for the analysis. If you specified a By variable, JMP creates a PValues table for each level of the By variable, and the Original Data variable gives the By variable and its level.

Airbnb New User Bookings, Winner’s Interview: 3rd place: Sandro Vega Pons AirBnB New User Bookings was a popular recruiting competition that challenged Kagglers to predict the first country where a new user would book travel. This was the first recruiting competition on Kaggle with scripts enabled. AirBnB encouraged participants to prove their chops through their collaboration and code sharing in addition to their final models. Sandro Vega Pons took 3rd place, ahead of 1,462 other competitors, using an ensemble of GradientBoosting, MLP, a RandomForest, and an ExtraTreesClassifier.

Self-Organising Maps for Customer Segmentation using R Self-Organising Maps (SOMs) are an unsupervised data visualisation technique that can be used to visualise high-dimensional data sets in lower (typically 2) dimensional representations. In this post, we examine the use of R to create a SOM for customer segmentation. The figures shown here used use the 2011 Irish Census information for the greater Dublin area as an example data set. This work is based on a talk given to the Dublin R Users group in January 2014. If you are keen to get down to business:

Books I like If you’re serious about learning, you probably need to read a book at some point. These days if you want to learn applied statistics and data science tools, you have amazing options in the form of blogs, Q&A sites, and massive open online courses and even videos on You Tube. Wikipedia is also an amazing reference resource on statistics. I use all those things to learn new techniques and understand old ones better, but I also love reading books. No, I’m not one of those sentimental people who go on about the texture of paper; while I do like the look and feel of a “real” book and love having them around, it’s a pain the way the take up space, and the big majority of books I buy these days are on an e-reader. So when I say I like “books”, it’s something the depth and the focus that comes with a good book going deeply into a matter.

How To Perform A Logistic Regression In R Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values. Comparison of String Distance Algorithms For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually.

Two meanings of priors, part I: The plausibility of models by Angelika Stefan & Felix Schönbrodt When reading about Bayesian statistics, you regularly come across terms like “objective priors“, “prior odds”, “prior distribution”, and “normal prior”. However, it may not be intuitively clear that the meaning of “prior” differs in these terms. In fact, there are two meanings of “prior” in the context of Bayesian statistics: (a) prior plausibilities of models, and (b) the quantification of uncertainty about model parameters. As this often leads to confusion for novices in Bayesian statistics, we want to explain these two meanings of priors in the next two blog posts*. The current blog post covers the the first meaning of priors.

Variable Selection Procedures - The LASSO The LASSO (Least Absolute Shrinkage and Selection Operator) is a method of automatic variable selection which can be used to select predictors X* of a target variable Y from a larger set of potential or candidate predictors X. Developed in 1996 by Tibshirani, the LASSO formulates curve fitting as a quadratic programming problem, where the objective function penalizes the absolute size of the regression coefficients, based on the value of a tuning parameter λ. In doing so, the LASSO can drive the coefficients of irrelevant variables to zero, thus performing automatic variable selection. This post features a toy example illustrating tactics in variable selection with the lasso. The post also dicusses the issue of consistency – how we know from a large sample perspective that we are honing in on the true set of predictors when we apply the LASSO. My take is a two-step approach is often best.

Technical Tidbits From Spatial Analysis & Data Science Even the most experienced R users need help creating elegant graphics. The ggplot2 library is a phenomenal tool for creating graphics in R but even after many years of near-daily use we still need to refer to our Cheat Sheet. Up until now, we’ve kept these key tidbits on a local PDF. But for our own benefit (and hopefully yours) we decided to post the most useful bits of code. The Meeting Point Locator Hi Hillary, It’s Donald, would you like to have a beer with me in La Cabra Brewing, in Berwyn, Pensilvania? (Hypothetical utilization of The Meeting Point Locator) Finding a place to have a drink with someone may become a difficult task. It is quite common that one of them does not want to move to the other’s territory. I am sure you have faced to this situation many times.