background preloader

Statistics

Facebook Twitter

Seeing Theory. Seeing Theory is a project designed and created by Daniel Kunin with support from Brown University's Royce Fellowship Program.

Seeing Theory

The goal of the project is to make statistics more accessible to a wider range of students through interactive visualizations. Statistics is quickly becoming the most important and multi-disciplinary field of mathematics. According to the American Statistical Association, "statistician" is one of the top ten fastest-growing occupations and statistics is one of the fastest-growing bachelor degrees. Statistical literacy is essential to our data driven society. Yet, for all the increased importance and demand for statistical competence, the pedagogical approaches in statistics have barely changed. The Mathematics of Machine Learning.

In the last few months, I have had several people contact me about their enthusiasm for venturing into the world of data science and using Machine Learning (ML) techniques to probe statistical regularities and build impeccable data-driven products.

The Mathematics of Machine Learning

However, I have observed that some actually lack the necessary mathematical intuition and framework to get useful results. This is the main reason I decided to write this blog post. Recently, there has been an upsurge in the availability of many easy-to-use machine and deep learning packages such as scikit-learn, Weka, Tensorflow, R-caret etc. Machine Learning theory is a field that intersects statistical, probabilistic, computer science and algorithmic aspects arising from learning iteratively from data and finding hidden insights which can be used to build intelligent applications. Unlearning descriptive statistics. If you've ever used an arithmetic mean, a Pearson correlation or a standard deviation to describe a dataset, I'm writing this for you.

Unlearning descriptive statistics

Better numbers exist to summarize location, association and spread: numbers that are easier to interpret and that don't act up with wonky data and outliers. Statistics professors tend to gloss over basic descriptive statistics because they want to spend as much time as possible on margins of error and t-tests and regression. Fair enough, but the result is that it's easier to find a machine learning expert than someone who can talk about numbers. Forget what you think you know about descriptives and let me give you a whirlwind tour of the real stuff. The average. R for Data Science. A Visual Introduction to Machine Learning. Finding better boundaries Let's revisit the 73-m elevation boundary proposed previously to see how we can improve upon our intuition.

A Visual Introduction to Machine Learning

Clearly, this requires a different perspective. By transforming our visualization into a histogram, we can better see how frequently homes appear at each elevation. While the highest home in New York is 73m, the majority of them seem to have far lower elevations. Teorías, hechos y mentes. Estamos viviendo un momento clave en el desarrollo económico de nuestras sociedades, y tal vez en la historia misma de la humanidad, como es la creación de verdaderos sistemas de Inteligencia Artificial.

Teorías, hechos y mentes

El avance del Big Data, el desarrollo y el éxito de técnicas como el Deep Learning y los ejemplos anecdóticos que empiezan a aparecer ya en nuestras vidas son sólo premoniciones de lo que se viene: un verdadero tsunami económico y social que va a suponer una seria convulsión política. Data Types 101. Comparing machine learning classifiers based on their hyperplanes or decision boundaries - Data Scientist TJO in Tokyo. In Japanese version of this blog, I've written a series of posts about how each kind of machine learning classifiers draws various classification hyperplanes or decision boundaries.

Comparing machine learning classifiers based on their hyperplanes or decision boundaries - Data Scientist TJO in Tokyo

So in this post I want to show you a summary of the series and how their hyperplanes or decision boundaries vary (translated from Japanese version). It must be interesting and help you understand a nature of each classifier. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. Top 10 data mining algorithms in plain R. Regression Models for Data… by Brian Caffo. Understanding p-values via simulations. As I mentioned in an earlier post, p-values in psychological research are often misunderstood.

Understanding p-values via simulations

Ask students (and academics!) What the definition of the p-value is and you will likely get many different responses. To jog your memory, the definition of the p-value is the probability of observing a test statistic as extreme—or more extreme—than the one you have observed, assuming the null is true. But, even with this definition in hand, many struggle to conceptualise what the p-value reflects. Regression Modelling. "Linear Regression is used predict or estimate the value of a response variable by modeling it against one or more explanatory variables.

Regression Modelling

The variables must be pairwise, continuous and are assumed to have a linear relationship between them. This technique is widely popular in predictive analysis. " Assumptions of a Linear Regression The residuals, calculated as the difference between actuals and predicted values measured along Y-axis, should follow a normal distribution (bell shaped curve).No heteroscedasticity exists. Simple Linear Regression: A complete introduction with numeric example. Linear regression is a predictive modelling technique that aims to predict the value of an outcome variable based on one or more input predictor variables.

Simple Linear Regression: A complete introduction with numeric example

The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so we can use it to estimate the value of the response, when predictors values are known. Introduction For this analysis, we will use the ‘cars’ dataset that comes with R by default. ‘cars’ is a standard built-in dataset, that makes it convenient to demonstrate linear regression in a simple and easy to understand fashion.

Advanced Linear Regression: A Case study. It is possible to build multiple regression models for just one set of response and predictor variables.

Advanced Linear Regression: A Case study

When you are manually building the models, it can be a herculean task to build even one valid statistically significant regression model especially when you are new to the data/problem. It can be rather frustrating if you later find out there is multi-collinearity or that your model does not perform equally well when cross validated on random samples or does not have good prediction accuracy on test data, or worse. What if we have the flexibility to build all choicest statistically valid models, see their prediction accuracy, cross-validate on random samples, compare all important diagnostic parameters from one place and finally pick the best one that suits your case?. The details that follow will attempt to solve this. If you need to learn the fundamentals, the basics of regression modelling will offer more details on this subject.

Problem description. Ordinary Least Squares Regression explained visually. Why You Need to Study Statistics. "Hey Statistics" (Hey Soul Sister Parody?) VassarStats: Statistical Computation Web Site. The Dot Product and Cosine. Gaston Sanchez. Statistics Hell. The timeline of statistics. ‘Study the past if you would define the future’ - Confucius. ‘The further back you can look, the further forward you are likely to see’ – Churchill. ‘If history were taught in the form of stories it would never be forgotten’ – Kipling.

P-Values. El arte de programar en R Un leguaje para la estadística. Deep Learning in a Nutshell. 29 December 2014 Deep learning. Neural networks. Backpropagation. Introductory R Presentation. Math Explains Likely Long Shots, Miracles and Winning the Lottery. Adapted from The Improbabilty Principle: Why Coincidences, Miracles, and Rare Events Happen Every Day, by David J. Hand, by arrangement with Scientific American/Farrar, Straus and Giroux, LLC (North America), Transworld (UK), Ambo|Anthos (Holland), C.H. Big Data, Machine Learning, and the Social Sciences. Papers/volume15/delgado14a/delgado14a.pdf. A non-comprehensive list of awesome things other people did in 2014. Stat545-ubc.github.io/index.html. Data science without statistics is possible, even desirable.

The purpose of this article is to clarify a few misconceptions about data and statistical science. I will start with a controversial statement: data science barely uses statistical science and techniques. The truth is actually more nuanced, as explained below. 1. Data science heavily uses new statistical science. Research that matters, results that make sense. A Brief Review of All Comic Books Teaching Statistics. Rasmus Bååth and Christian Robert. A two-hour online course on ggplot2 and Shiny. 0s.pdf. Statistics is losing ground to computer science. A geometric interpretation of the covariance matrix. Introduction.

iNZight for Data Analysis. Vasishthbroe.pdf. John Rauser keynote: "Statistics Without the Agonizing Pain" (2) How do random forests work in layman's terms? Basic-Econometrics.pdf. Matrix_algebra.pdf. Collaborative Statistics. Statistics Using Technology. I hope you find this book useful in teaching statistics. Bayesian statistics: a comprehensive course. Dm-stat.pdf. Overfitting: Machine Learning Music Video. Neglected machine learning ideas. I am not an econometrician. A Web Journal about Machine Learning, Music, and other Mischief. "Hey Statistics" (Hey Soul Sister Parody?) Guy's Econometrics blog: XtransX to the minus one X transpose Y. The Analysis Factor — Statistical Consulting, Resources, and Statistics Workshops for Researchers in Psychology, Sociology, and other Social and Biological Sciences.

Ben Lambert. Www.kevinsheppard.com/images/0/09/Python_introduction.pdf. Iospress.metapress.com/content/l507114250630285/fulltext.pdf. Statistical Shortcomings in Standard Math Libraries (And How To Fix Them) Trey Causey - Getting Started in Data Science. 100+ Interesting Data Sets for Statistics. Statistics Blogs @ StatsBlogs.com. The Analysis Factor — Statistical Consulting, Resources, and Statistics Workshops for Researchers in Psychology, Sociology, and other Social and Biological Sciences.

The Birthday Simulation. Introducing Probability. Homepages.inf.ed.ac.uk/vlavrenk/iaml.html. Machine Learning A Cappella - Overfitting Thriller! Spurious Correlations. Twitter. Statistics Hell. Eight (No, Nine!) Problems With Big Data. Young Researchers in Biostatistics. Distance Education § Harvard University Extension School. Nisla05.niss.org/copss/past-present-future-copss.pdf. ¿Qué es eso de crecer exponencialmente? StatsTeachR. Statistics Lessons. 4.2 Model Selection Viewed As Search.