 # Cluster Analysis R has an amazing variety of functions for cluster analysis. In this section, I will describe three of the many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below. Data Preparation Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability. # Prepare Data mydata <- na.omit(mydata) # listwise deletion of missing mydata <- scale(mydata) # standardize variables Partitioning K-means clustering is the most popular partitioning method. # Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)\$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). Hierarchical Agglomerative

Airbnb New User Bookings, Winner’s Interview: 3rd place: Sandro Vega Pons AirBnB New User Bookings was a popular recruiting competition that challenged Kagglers to predict the first country where a new user would book travel. This was the first recruiting competition on Kaggle with scripts enabled. AirBnB encouraged participants to prove their chops through their collaboration and code sharing in addition to their final models. Sandro Vega Pons took 3rd place, ahead of 1,462 other competitors, using an ensemble of GradientBoosting, MLP, a RandomForest, and an ExtraTreesClassifier. The Basics What was your background prior to entering this challenge? I currently work as a postdoctoral researcher at the NeuroInformatics Laboratory, FBK in Trento, Italy. How did you get started competing on Kaggle? I first heard about Kaggle around three years ago, when a colleague showed me the website. Sandro's top 8 finishes What made you decide to enter this competition? Let's Get Technical What preprocessing and supervised learning methods did you use? Fig. 1 Feature Engineering:

How To Perform A Logistic Regression In R Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values. Logistic regression implementation in R R makes it very easy to fit a logistic regression model. The dataset We’ll be working on the Titanic dataset. The data cleaning process When working with a real dataset we need to take into account the fact that some data might be missing or corrupted, therefore we need to prepare the dataset for our analysis. training.data.raw <- read.csv('train.csv',header=T,na.strings=c("")) Now we need to check for missing values and look how many unique values there are for each variable using the sapply() function which applies the function passed as argument to each column of the dataframe. data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10,12))

Variable Selection Procedures - The LASSO | Business Forecasting The LASSO (Least Absolute Shrinkage and Selection Operator) is a method of automatic variable selection which can be used to select predictors X* of a target variable Y from a larger set of potential or candidate predictors X. Developed in 1996 by Tibshirani, the LASSO formulates curve fitting as a quadratic programming problem, where the objective function penalizes the absolute size of the regression coefficients, based on the value of a tuning parameter λ. In doing so, the LASSO can drive the coefficients of irrelevant variables to zero, thus performing automatic variable selection. This post features a toy example illustrating tactics in variable selection with the lasso. The post also dicusses the issue of consistency – how we know from a large sample perspective that we are honing in on the true set of predictors when we apply the LASSO. My take is a two-step approach is often best. Toy Example The following Table illustrates something of the power of the lasso. Real World Examples