background preloader

Cluster Analysis

Cluster Analysis
R has an amazing variety of functions for cluster analysis. In this section, I will describe three of the many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below. Data Preparation Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability. # Prepare Data mydata <- na.omit(mydata) # listwise deletion of missing mydata <- scale(mydata) # standardize variables Partitioning K-means clustering is the most popular partitioning method. # Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). Hierarchical Agglomerative

Airbnb New User Bookings, Winner’s Interview: 3rd place: Sandro Vega Pons AirBnB New User Bookings was a popular recruiting competition that challenged Kagglers to predict the first country where a new user would book travel. This was the first recruiting competition on Kaggle with scripts enabled. AirBnB encouraged participants to prove their chops through their collaboration and code sharing in addition to their final models. Sandro Vega Pons took 3rd place, ahead of 1,462 other competitors, using an ensemble of GradientBoosting, MLP, a RandomForest, and an ExtraTreesClassifier. The Basics What was your background prior to entering this challenge? I currently work as a postdoctoral researcher at the NeuroInformatics Laboratory, FBK in Trento, Italy. How did you get started competing on Kaggle? I first heard about Kaggle around three years ago, when a colleague showed me the website. Sandro's top 8 finishes What made you decide to enter this competition? Let's Get Technical What preprocessing and supervised learning methods did you use? Fig. 1 Feature Engineering:

How To Perform A Logistic Regression In R Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values. Logistic regression implementation in R R makes it very easy to fit a logistic regression model. The dataset We’ll be working on the Titanic dataset. The data cleaning process When working with a real dataset we need to take into account the fact that some data might be missing or corrupted, therefore we need to prepare the dataset for our analysis. <- read.csv('train.csv',header=T,na.strings=c("")) Now we need to check for missing values and look how many unique values there are for each variable using the sapply() function which applies the function passed as argument to each column of the dataframe. data <- subset(,select=c(2,3,5,6,7,8,10,12))

Variable Selection Procedures - The LASSO | Business Forecasting The LASSO (Least Absolute Shrinkage and Selection Operator) is a method of automatic variable selection which can be used to select predictors X* of a target variable Y from a larger set of potential or candidate predictors X. Developed in 1996 by Tibshirani, the LASSO formulates curve fitting as a quadratic programming problem, where the objective function penalizes the absolute size of the regression coefficients, based on the value of a tuning parameter λ. In doing so, the LASSO can drive the coefficients of irrelevant variables to zero, thus performing automatic variable selection. This post features a toy example illustrating tactics in variable selection with the lasso. The post also dicusses the issue of consistency – how we know from a large sample perspective that we are honing in on the true set of predictors when we apply the LASSO. My take is a two-step approach is often best. Toy Example The following Table illustrates something of the power of the lasso. Real World Examples

Audience Segmentation - Giving Clicks a Personality | Targeting & Segmentation Times have changed. My uncle is in his seventies, uses Facebook more than I do and is working as an extra in films and television commercials. My mother has retired and was discussing her modem with me last week. Just ten years ago, things were different. Now, however, folks on the internet are a lot more assured in their expectations from online experiences. Attribution is becoming a much more difficult craft with the advent of varying devices and sharing tools so it is ever more important to treat users as an interactive audience rather than as numbers. Below, you can see a few simple examples of ways in which an audience can begin to be identified. Identify the age and gender of your audience Use this information to rethink ad targeting, site design, "voice" etc. Find out what interests your audience Use this information to learn more about your qualified visitors and what makes them tick. Learn more about the audience you interact with and that of you site

High speed trading swimming - Marginal REVOLUTION Next year the innovative swimming suits that are causing world records to fall at rapid pace will be banned. Michael Mandel wonders if this is the beginning of the counterrevolution against technological progress and Tyler argues “essentially on innovation we’re seeing a flipping of the burden of proof and I don’t think it is possible to easily fine-tune that flipping in a way to capture good innovations and rule out bad ones.” Believe it or not, Mandel really was talking about swimsuits. High-tech swimming suits and trading systems are primarily about distribution not efficiency. One difference between swimsuits and trading systems is that the former are regulated by FINA, the federation that administers international competition in aquatic sports. NASDAQ (and the other exchanges) are the logical equivalent to NASCAR and FINA in that they can internalize the externalities among the primary players. But would exchange regulation go far enough?