background preloader

A Taxonomy of Data Science

Posted: September 25th, 2010 | Author: Hilary Mason | Filed under: Philosophy of Data | Tags: data, data science, osemn, taxonomy | 31 Comments Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science? We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. Different data scientists have different levels of expertise with each of these 5 areas, but ideally a data scientist should be at home with them all. We describe each one of these steps briefly below: Obtain: pointing and clicking does not scale. Deep thoughts: Our next post addresses how one goes about learning these skills, that is: “what does a data science curriculum look like?” Related:  Stats

Significantly misleading Author: Mark Kelly Mark Twain with characteristic panache said ‘…I am dead to adverbs, they cannot excite me’. Stephen King agrees saying ‘The road to hell is paved with adverbs’. The idea being of course that if you are using an adverb you have chosen the wrong verb. It is stronger to say ‘He shouted’ than it is to say ‘He said loudly’. What are we to make then of the ubiquitous ‘statistically significantly related’. ‘Statistically significant’ is a tremendously ugly phrase but unfortunately that is the least of its shortcomings. Imagine if an environmentalist said that oil contamination was detectable in a sample of water from a protected coral reef. What we mean by a ‘statistically significant’ difference is that the difference is ‘unlikely to be zero’. Statistically discernible is still 50% adverb however.

The Tube Open Movie by Bassam Kurdali » Updates Friends! Supporters! Please pardon the radio silence while we've been cranking frenetically to get the movie made. Conducting such an ambitious project with a tiny budget means that we all work on Tube with one hand while also keeping the lights on with the other. Our lovely crew is pushing hard to ready the trailer for release in time for the Siggraph conference next week, which five of Tube's artists (Bassam, Pablo, Hanny, Francesco, and Bing-Run) will take a few days out to attend. To whet the appetite, here are a few render tests from the work that's been in-progress, as well as a fast look at some of what's been happening: Between inescapable bouts of his trademark rigging, Bassam's screens are full with a mix of directing, project management, shading tasks, time-lapse animation, pipeline coding, and more. A great group of super-talented artists and interns have joined our local crew both visiting from abroad and online.

Weak statistical standards implicated in scientific irreproducibility The plague of non-reproducibility in science may be mostly due to scientists’ use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University in College Station. Johnson compared the strength of two types of tests: frequentist tests, which measure how unlikely a finding is to occur by chance, and Bayesian tests, which measure the likelihood that a particular hypothesis is correct given data collected in the study. The strength of the results given by these two types of tests had not been compared before, because they ask slightly different types of questions. So Johnson developed a method that makes the results given by the tests — the P value in the frequentist paradigm, and the Bayes factor in the Bayesian paradigm — directly comparable. Johnson then used these uniformly most powerful tests to compare P values to Bayes factors. Indeed, as many as 17–25% of such findings are probably false, Johnson calculates1.

Tube – Epic Production Notes | 3D animated filmmaking in free software and the commons datasharing QQ Plots for NYs Ozone Pollution Data Introduction Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution. I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R. Previous posts in this series on EDA include Learn how to create a quantile-quantile plot like this one with R code in the rest of this blog! What is a Quantile-Quantile Plot? A quantile-quantile plot, or Q-Q plot, is a plot of the sorted quantiles of one data set against the sorted quantiles of another data set. The sample sizes of the 2 data sets do not have to be equal. The quantiles of the 2 data sets can be observed or theoretical. Constructing Quantile-Quantile Plots to Check Goodness of Fit The following steps will build a Q-Q plot to check how well a data set fits a particular theoretical distribution. References

Instrumental Variables Jan 10, 2014 Instrumental variables are an incredibly powerful for dealing with unobserved heterogenity within the context of regression but the language used to define them is mind bending. Typically, you hear something along the lines of “an instrumental variable is a variable that is correlated with x but uncorrelated with the outcome except through x.” I like math stats (when I am not getting a grade for it at least!) I turned to Google and did several searches and the only simple simulation that I could find was done using Stata. Overview Suppose that you have a continuous variable with the known mean response function and further that and are correlated with each other. But often we don't observe in our data. could be any number of things, such as treatment practices at a hospital or unmeasured differences between patients, but it is in the direct casual path of and you don't know it. but in this case where is white noise centered on zero. and Simulations z <- rnorm(1000)x <- xStar + z