background preloader

Books

Facebook Twitter

Cluster Analysis. R has an amazing variety of functions for cluster analysis.

Cluster Analysis

In this section, I will describe three of the many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below. Data Preparation Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability. # Prepare Data mydata <- na.omit(mydata) # listwise deletion of missing mydata <- scale(mydata) # standardize variables Partitioning K-means clustering is the most popular partitioning method. . # Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

Airbnb New User Bookings, Winner’s Interview: 3rd place: Sandro Vega Pons. AirBnB New User Bookings was a popular recruiting competition that challenged Kagglers to predict the first country where a new user would book travel.

Airbnb New User Bookings, Winner’s Interview: 3rd place: Sandro Vega Pons

This was the first recruiting competition on Kaggle with scripts enabled. AirBnB encouraged participants to prove their chops through their collaboration and code sharing in addition to their final models. Sandro Vega Pons took 3rd place, ahead of 1,462 other competitors, using an ensemble of GradientBoosting, MLP, a RandomForest, and an ExtraTreesClassifier. In this blog, Sandro explains his approach and shares key scripts that illustrate important aspects of his analysis. The Basics What was your background prior to entering this challenge?

I currently work as a postdoctoral researcher at the NeuroInformatics Laboratory, FBK in Trento, Italy. The NASDAQ Stock Market. How To Perform A Logistic Regression In R. Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable.

How To Perform A Logistic Regression In R

The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values. In the simplest case scenario y is binary meaning that it can assume either the value 1 or 0. Building a Smarter Application. Why does the Lasso provide Variable Selection?

Variable Selection Procedures - The LASSO. The LASSO (Least Absolute Shrinkage and Selection Operator) is a method of automatic variable selection which can be used to select predictors X* of a target variable Y from a larger set of potential or candidate predictors X.

Variable Selection Procedures - The LASSO

Developed in 1996 by Tibshirani, the LASSO formulates curve fitting as a quadratic programming problem, where the objective function penalizes the absolute size of the regression coefficients, based on the value of a tuning parameter λ. In doing so, the LASSO can drive the coefficients of irrelevant variables to zero, thus performing automatic variable selection.

This post features a toy example illustrating tactics in variable selection with the lasso. The post also dicusses the issue of consistency – how we know from a large sample perspective that we are honing in on the true set of predictors when we apply the LASSO. My take is a two-step approach is often best. Toy Example The following Table illustrates something of the power of the lasso. Real World Examples. Shelter Animal Outcomes. P hacking. Audience Segmentation - Giving Clicks a Personality. Times have changed.

Audience Segmentation - Giving Clicks a Personality

My uncle is in his seventies, uses Facebook more than I do and is working as an extra in films and television commercials. My mother has retired and was discussing her modem with me last week. My other-half's son communicates in some kind of complex code through instant messenger status updates. A dinner conversation between four can extend to thousands through the medium of a social network. Your online presence and privacy (even if you've never used the internet) tends to depend heavily on the privacy stance and sharing tendencies of those around you. Just ten years ago, things were different. Now, however, folks on the internet are a lot more assured in their expectations from online experiences.

Attribution is becoming a much more difficult craft with the advent of varying devices and sharing tools so it is ever more important to treat users as an interactive audience rather than as numbers. Identify the age and gender of your audience. High speed trading swimming - Marginal REVOLUTION. Next year the innovative swimming suits that are causing world records to fall at rapid pace will be banned.

High speed trading swimming - Marginal REVOLUTION

Michael Mandel wonders if this is the beginning of the counterrevolution against technological progress and Tyler argues “essentially on innovation we’re seeing a flipping of the burden of proof and I don’t think it is possible to easily fine-tune that flipping in a way to capture good innovations and rule out bad ones.” Believe it or not, Mandel really was talking about swimsuits. Tyler, however, was talking about high speed trading but is there much difference between the two? I don’t think so. High-tech swimming suits and trading systems are primarily about distribution not efficiency.

One difference between swimsuits and trading systems is that the former are regulated by FINA, the federation that administers international competition in aquatic sports. But would exchange regulation go far enough? Amazon. Amazon. Amazon.