background preloader

A Taxonomy of Data Science

Posted: September 25th, 2010 | Author: Hilary Mason | Filed under: Philosophy of Data | Tags: data, data science, osemn, taxonomy | 31 Comments Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science? We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. Different data scientists have different levels of expertise with each of these 5 areas, but ideally a data scientist should be at home with them all. We describe each one of these steps briefly below: Obtain: pointing and clicking does not scale. Deep thoughts: Our next post addresses how one goes about learning these skills, that is: “what does a data science curriculum look like?” Related:  Stats

Significantly misleading Author: Mark Kelly Mark Twain with characteristic panache said ‘…I am dead to adverbs, they cannot excite me’. Stephen King agrees saying ‘The road to hell is paved with adverbs’. The idea being of course that if you are using an adverb you have chosen the wrong verb. It is stronger to say ‘He shouted’ than it is to say ‘He said loudly’. What are we to make then of the ubiquitous ‘statistically significantly related’. ‘Statistically significant’ is a tremendously ugly phrase but unfortunately that is the least of its shortcomings. Imagine if an environmentalist said that oil contamination was detectable in a sample of water from a protected coral reef. What we mean by a ‘statistically significant’ difference is that the difference is ‘unlikely to be zero’. Statistically discernible is still 50% adverb however.

Weak statistical standards implicated in scientific irreproducibility The plague of non-reproducibility in science may be mostly due to scientists’ use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University in College Station. Johnson compared the strength of two types of tests: frequentist tests, which measure how unlikely a finding is to occur by chance, and Bayesian tests, which measure the likelihood that a particular hypothesis is correct given data collected in the study. The strength of the results given by these two types of tests had not been compared before, because they ask slightly different types of questions. So Johnson developed a method that makes the results given by the tests — the P value in the frequentist paradigm, and the Bayes factor in the Bayesian paradigm — directly comparable. Johnson then used these uniformly most powerful tests to compare P values to Bayes factors. Indeed, as many as 17–25% of such findings are probably false, Johnson calculates1.

datasharing QQ Plots for NYs Ozone Pollution Data Introduction Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution. I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R. Previous posts in this series on EDA include Learn how to create a quantile-quantile plot like this one with R code in the rest of this blog! What is a Quantile-Quantile Plot? A quantile-quantile plot, or Q-Q plot, is a plot of the sorted quantiles of one data set against the sorted quantiles of another data set. The sample sizes of the 2 data sets do not have to be equal. The quantiles of the 2 data sets can be observed or theoretical. Constructing Quantile-Quantile Plots to Check Goodness of Fit The following steps will build a Q-Q plot to check how well a data set fits a particular theoretical distribution. References

Instrumental Variables Jan 10, 2014 Instrumental variables are an incredibly powerful for dealing with unobserved heterogenity within the context of regression but the language used to define them is mind bending. Typically, you hear something along the lines of “an instrumental variable is a variable that is correlated with x but uncorrelated with the outcome except through x.” I like math stats (when I am not getting a grade for it at least!) I turned to Google and did several searches and the only simple simulation that I could find was done using Stata. Overview Suppose that you have a continuous variable with the known mean response function and further that and are correlated with each other. But often we don't observe in our data. could be any number of things, such as treatment practices at a hospital or unmeasured differences between patients, but it is in the direct casual path of and you don't know it. but in this case where is white noise centered on zero. and Simulations z <- rnorm(1000)x <- xStar + z

Taleb - Deviation The notion of standard deviation has confused hordes of scientists; it is time to retire it from common use and replace it with the more effective one of mean deviation. Standard deviation, STD, should be left to mathematicians, physicists and mathematical statisticians deriving limit theorems. There is no scientific reason to use it in statistical investigations in the age of the computer, as it does more harm than good—particularly with the growing class of people in social science mechanistically applying statistical tools to scientific problems. Say someone just asked you to measure the "average daily variations" for the temperature of your town (or for the stock price of a company, or the blood pressure of your uncle) over the past five days. The five changes are: (-23, 7, -3, 20, -1). How do you do it? Do you take every observation: square it, average the total, then take the square root? It all comes from bad terminology for something non-intuitive.

Absolute Deviation Around the Median Median Absolute Deviation (MAD) or Absolute Deviation Around the Median as stated in the title, is a robust measure of central tendency. Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions. Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. Essentially the breakdown point for a parameter (median, mean, variance, etc.) is the proportion or number of arbitrarily small or large extreme values that must be introduced into a sample to cause the estimator to yield an arbitrarily bad result. For example: If you have ordered set [2, 6, 6, 12, 17, 25 ,32], the median is 12 and the mean is 14.28. To calculate the MAD, we find the median of absolute deviations from the median. Using the same set from earlier: We now have our MAD (8.8956) to use in our predetermined threshold.

Use standard deviation (not mad about MAD) Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models. Let’s suppose we have 2 boxes of 10 lottery tickets: all tickets were purchased for $1 each for the same game in an identical fashion at the same time. Now since all tickets are identical if we are making a mere point-prediction (a single number value estimate for each ticket instead of a detailed posterior distribution) then there is an optimal prediction that is a single number V. Suppose we use mean absolute deviation as our measure of model quality. Be Sociable, Share!