background preloader


Facebook Twitter

Statistical Data Mining Tutorials. Advertisment: In 2006 I joined Google.

Statistical Data Mining Tutorials

We are growing a Google Pittsburgh office on CMU's campus. We are hiring creative computer scientists who love programming, and Machine Learning is one the focus areas of the office. We're also currently accepting resumes for Fall 2008 intenships. If you might be interested, feel welcome to send me email: . The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.

These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. I hope they're useful (and please let me know if they are, or if you have suggestions or error-corrections). Central limit theorem. The central limit theorem has a number of variants.

Central limit theorem

In its common form, the random variables must be identically distributed. In variants, convergence of the mean to the normal distribution also occurs for non-identical distributions, given that they comply with certain conditions. In more general probability theory, a central limit theorem is any of a set of weak-convergence theorems. They all express the fact that a sum of many independent and identically distributed (i.i.d.) random variables, or alternatively, random variables with specific types of dependence, will tend to be distributed according to one of a small set of attractor distributions. When the variance of the i.i.d. variables is finite, the attractor distribution is the normal distribution. Central limit theorems for independent sequences[edit] Classical CLT[edit] of these random variables. Lindeberg–Lévy CLT. Where Φ(x) is the standard normal cdf evaluated at x. Lyapunov CLT[edit] Lindeberg CLT[edit] Let be the i-vector. Benford's law.

The distribution of first digits, according to Benford's law.

Benford's law

Each bar represents a digit, and the height of the bar is the percentage of numbers that start with that digit. Frequency of first significant digit of physical constants plotted against Benford's law Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time.

Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution. The graph here shows Benford's Law for base 10. It is named after physicist Frank Benford, who stated it in 1938,[1] although it had been previously stated by Simon Newcomb in 1881.[2] Mathematical statement[edit] Magician-turned-mathematician uncovers bias in coin flipping. By ESTHER LANDHUIS Persi Diaconis has spent much of his life turning scams inside out.

Magician-turned-mathematician uncovers bias in coin flipping

In 1962, the then 17-year-old sought to stymie a Caribbean casino that was allegedly using shaved dice to boost house odds in games of chance. In the mid-1970s, the upstart statistician exposed some key problems in ESP research and debunked a handful of famed psychics. Now a Stanford professor of mathematics and statistics, Diaconis has turned his attention toward simpler phenomena: determining whether coin flipping is random. Could a simple coin toss -- used routinely to decide which team gets the ball, for instance -- actually be rigged? Diaconis set out to test what he thought was obvious -- that coin tosses, the currency of fair choices, couldn't be biased. Hypercube Projection.


Programing. Fundamentals of Mathematics. Fractals. Number Theory.