background preloader

Stats

Facebook Twitter

Cross Validated. Kaggle. DrivenData. Randomise Me. Mirador. Mirador Mirador is a tool for visual exploration of complex datasets.

mirador

It enables users to discover correlation patterns and derive new hypotheses from the data. Download 1.3 (8 December 2014) Windows Mac OS X Instructions Download the file appropriate for your operating system. About Mirador is an open source project released under the GNU Public License v2. Further reading Ebola prognosis prediction—Computational methods for patient prognosis based on available clinical data—June 9th, 2015 Ebola data release—De-identified clinical data from Ebola patients treated at the Kenema Government Hospital in Sierra Leone between May and June of 2014—February 26th, 2015 Awards from the Department of Health and Human Services—Mirador received the third place, innovation and scientific excellence awards in the HHS VizRisk challenge—January 5th, 2015 Winning entries in the Mirador Data Competition—Read about the winning correlations submitted by Mirador users—December 1st, 2014.

Automatic Statistician. Datasharing. Apache Arrow. BigParser - Search across all your spreadsheets smartly via your mobile. Dataproofer: A proofreader for your data. Statcheck. Comma Chameleon. Engaging students in learning statistics using The Islands. Three Problems and a Solution Modern teaching methods for statistics have gone beyond the mathematical calculation of trivial problems.

Engaging students in learning statistics using The Islands

Computers can enable large size studies, bringing reality to the subject, but this is not without its own problems. There are many reasons for students to learn statistics through running their own projects, following the complete statistical enquiry process, posing a problem, planning the data collection, collecting and cleaning the data, analysing the data and drawing conclusions that relate back to the original problem. Individual projects can be both time consuming and risky, as the quality of the report, and the resultant grade can be dependent on the quality of the data collected, which may be beyond the control of the student.

The Statistical Enquiry Cycle, which underpins the NZ statistics curriculum. The problem here is obvious. I recently ran an exciting workshop for teachers on using The Islands. DrawMyData. E/R Assistant. Overview of statistics. Putting the methods you use into context It may come as a surprise, but the way you were probably taught statistics during your undergraduate years is not the way statistics is done.

Overview of statistics

There are a number of different ways of thinking about and doing statistics. Glossary of statistical terms. A Taxonomy of Data Science. Posted: September 25th, 2010 | Author: Hilary Mason | Filed under: Philosophy of Data | Tags: data, data science, osemn, taxonomy | 31 Comments Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science? We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. All of these are partially true, so we thought it would be useful to propose one possible taxonomy — we call it the Snice* taxonomy — of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum).

Timeline of Statistics. Lessons in writing for the general public with the editor of Significance. Julian Champkin, the editor of Significance magazine, gave an entertaining talk on how statisticians can write about their specialised field, but in a simple and interesting way.

Lessons in writing for the general public with the editor of Significance

He began by quoting what Sir Mark Walport had said the previous day on the importance of communicating what statistics mean to wider society. He stressed the usefulness of considering your audience when you are writing an article. Especially when you are explaining a highly technical concept, you need to think about how you would explain it to your granny. Nine secrets you should have been taught as part of your undergrad stats degree. Based on this post that I read recently, I decided to put together a list of secrets that you should have learned during an undergraduate statistics degree, but probably didn’t.

Nine secrets you should have been taught as part of your undergrad stats degree

So, in no particular order. 1. Bayesian statistics It seems incredible that stats degrees tend to have optional Bayesian modules but Bayesian statistics shouldn’t be optional. The benefits are obvious. 2.

Stat Articles/Blogs

Stats Books (Including R) R. Data organization. My collaborators sometimes ask me, “In what form would you like the data?”

data organization

My response is always, “In its current form!” If the data need to be reformatted, it’s much better for me to write a script than for them to do a bunch of cut-and-paste. I’m a strong proponent of data analysts being able to handle any data files they might receive. But in many cases, I have to spend a lot of time writing scripts to rearrange the layout of the data. And how would you like your data analysts to spend their time? Modes, Medians and Means: A Unifying Perspective. Introduction / Warning Any traditional introductory statistics course will teach students the definitions of modes, medians and means.

Modes, Medians and Means: A Unifying Perspective

But, because introductory courses can’t assume that students have much mathematical maturity, the close relationship between these three summary statistics can’t be made clear. This post tries to remedy that situation by making it clear that all three concepts arise as specific parameterizations of a more general problem. To do so, I’ll need to introduce one non-standard definition that may trouble some readers. Sufficient statistics misunderstandings. Experience with the normal distribution makes people think all distributions have (useful) sufficient statistics [1].

Sufficient statistics misunderstandings

If you have data from a normal distribution, then the sufficient statistics are the sample mean and sample variance. These statistics are “sufficient” in that the entire data set isn’t any more informative than those two statistics. They effectively condense the data for you. (This is conditional on knowing the data come from a normal. More on that shortly.) Absolute Deviation Around the Median. Median Absolute Deviation (MAD) or Absolute Deviation Around the Median as stated in the title, is a robust measure of central tendency.

Absolute Deviation Around the Median

Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions. Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. This robustness is well illustrated by the median’s breakdown point Donoho & Huber, 1983. Taleb - Deviation. The notion of standard deviation has confused hordes of scientists; it is time to retire it from common use and replace it with the more effective one of mean deviation.

Taleb - Deviation

Standard deviation, STD, should be left to mathematicians, physicists and mathematical statisticians deriving limit theorems. There is no scientific reason to use it in statistical investigations in the age of the computer, as it does more harm than good—particularly with the growing class of people in social science mechanistically applying statistical tools to scientific problems. Say someone just asked you to measure the "average daily variations" for the temperature of your town (or for the stock price of a company, or the blood pressure of your uncle) over the past five days.

The five changes are: (-23, 7, -3, 20, -1). Use standard deviation (not mad about MAD) Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models.

Standard deviation vs Standard error. I got often asked (i.e. more than two times) by colleagues if they should plot/use the standard deviation or the standard error, here is a small post trying to clarify the meaning of these two metrics and when to use them with some R code example. Standard deviation Standard deviation is a measure of dispersion of the data from the mean. set.seed(20151204) #generate some random data x<-rnorm(10) #compute the standard deviation sd(x) 1.144105. Standard Error. The standard error is an estimate of the standard deviation of a statistic. This lesson shows how to compute the standard error, based on sample data. The standard error is important because it is used to compute other measures, like confidence intervals and margins of error.

Notation The following notation is helpful, when we talk about the standard deviation and the standard error. Confidence and Prediction Intervals. Most people use them almost synonymously but there is one major difference: A confidence interval is used to predict the values in which a future population mean will fall. Difference between prediction intervals and confidence intervals. Pre­dic­tion inter­vals and con­fi­dence inter­vals are not the same thing. How do you think about the values in a confidence interval? When Discussing Confidence Level With Others… Cls: informal, traditional, bootstrap. Confidence intervals are needed because there is variation in the world.

Nearly all natural, human or technological processes result in outputs which vary to a greater or lesser extent. Sig P-Vals and Overlapping CIs. 02. #Set some constants. Binomial CIs. The problem with p values: how significant are they, really? For researchers there’s a lot that turns on the p value, the number used to determine whether a result is statistically significant. The current consensus is that if p is less than .05, a study has reached the holy grail of being statistically significant, and therefore likely to be published.

Over .05 and it’s usually back to the drawing board. But today, Texas A&M University professor Valen Johnson, writing in the prestigious journal Proceedings of the National Academy of Sciences, argues that p less than .05 is far too weak a standard. Using .05 is, he contends, a key reason why false claims are published and many published results fail to replicate. Cosines and correlation. Preview. Degrees of Freedom Tutorial. Transforming data with zeros. Log Transformations for Skewed and Wide Distributions. Don't do the Wilcoxon. The Wilcoxon test is a nonparametric rank-based test for comparing two groups. Four types of errors. 7 ways to separate errors from statistics.

Forecasting

Regression. Survival Analysis. ANOVA. Bayesian. Surveys. Ordinal Chi-Square. Chart of distribution relationships. Univariate Distribution Relationship Chart. Distribution Families. What distribution does my data have? Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. Prob and Stats Cookbook. Probability Cheatsheet. QQ plots. QQ Plots for NYs Ozone Pollution Data. Overview: Epi Measurmeants. Absolute vs relative risk – making sense of media stories.

Some ideas on communicating risks to the general public. When can odds ratios mislead? How to explain screening test outcomes. Value of re-analysis. Multiple testing. Cross-Validation: Why every statistician should know. AIC & BIC vs. Crossvalidation. Instrumental Variables. The Quartz Guide To Bad Data. Warning Signs in Experimental Design. Most published research results are false. Is epidemiology 90% wrong? 14 to 40 percent of medical research are false positives. Weak statistical standards implicated in scientific irreproducibility.

New Truths That Only One Can See. Putting a Value to ‘Real’ in Medical Research. Still Not Significant. Significantly misleading. Scientific method: Statistical errors. Worry about correctness and repeatability, not p-values. Statisticians Found One Thing They Can Agree On: It’s Time To Stop Misusing P-Values. P-hacker Shiny App. Why Not to Trust Statistics. Our nine-point guide to spotting a dodgy statistic.