background preloader

Blog Archive » A Taxonomy of Data Science

Posted: September 25th, 2010 | Author: Hilary Mason | Filed under: Philosophy of Data | Tags: data, data science, osemn, taxonomy | 31 Comments Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science? We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. All of these are partially true, so we thought it would be useful to propose one possible taxonomy — we call it the Snice* taxonomy — of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum). We describe each one of these steps briefly below: Obtain: pointing and clicking does not scale. Related:  Stats

Comment on Sheila's Blog One of the recommended feedbacks/reviews by our facilitator is a blog post by Sheila: I didn't really read through her post yet, but I understand from Mark's comment that one of the points mentioned by Sheila was about getting herself adapted to MOOC - which is what everyone is experiencing even though we're already in the end of Week 2! Well, welcome to the club! ;D So here's my comment (which is at this moment, still in waiting line for approval by Sheila): I agree with Mark. In fact, I find it more comfortable reading others' reviews on the texts, and then go into the texts to 'understand the gist' myself and compare it with the reviews. One thing for sure, what our eyes spot as interest might be a different angle of understanding compared to others. - Shazz, Kuala Lumpur Sincerely, a learner,- Shazz @ LAK

Significantly misleading Author: Mark Kelly Mark Twain with characteristic panache said ‘…I am dead to adverbs, they cannot excite me’. Stephen King agrees saying ‘The road to hell is paved with adverbs’. The idea being of course that if you are using an adverb you have chosen the wrong verb. It is stronger to say ‘He shouted’ than it is to say ‘He said loudly’. What are we to make then of the ubiquitous ‘statistically significantly related’. ‘Statistically significant’ is a tremendously ugly phrase but unfortunately that is the least of its shortcomings. Imagine if an environmentalist said that oil contamination was detectable in a sample of water from a protected coral reef. What we mean by a ‘statistically significant’ difference is that the difference is ‘unlikely to be zero’. Statistically discernible is still 50% adverb however.

How do I become a data scientist The Tube Open Movie by Bassam Kurdali » Updates Friends! Supporters! Please pardon the radio silence while we've been cranking frenetically to get the movie made. Conducting such an ambitious project with a tiny budget means that we all work on Tube with one hand while also keeping the lights on with the other. Our lovely crew is pushing hard to ready the trailer for release in time for the Siggraph conference next week, which five of Tube's artists (Bassam, Pablo, Hanny, Francesco, and Bing-Run) will take a few days out to attend. We look forward to seeing some of you there! To whet the appetite, here are a few render tests from the work that's been in-progress, as well as a fast look at some of what's been happening: Between inescapable bouts of his trademark rigging, Bassam's screens are full with a mix of directing, project management, shading tasks, time-lapse animation, pipeline coding, and more. A great group of super-talented artists and interns have joined our local crew both visiting from abroad and online.

Learning and Knowledge Analytics Weak statistical standards implicated in scientific irreproducibility The plague of non-reproducibility in science may be mostly due to scientists’ use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University in College Station. Johnson compared the strength of two types of tests: frequentist tests, which measure how unlikely a finding is to occur by chance, and Bayesian tests, which measure the likelihood that a particular hypothesis is correct given data collected in the study. The strength of the results given by these two types of tests had not been compared before, because they ask slightly different types of questions. So Johnson developed a method that makes the results given by the tests — the P value in the frequentist paradigm, and the Bayes factor in the Bayesian paradigm — directly comparable. Johnson then used these uniformly most powerful tests to compare P values to Bayes factors. Indeed, as many as 17–25% of such findings are probably false, Johnson calculates1.

Data Mining, Predictive Modeling, Techniques Data Mining Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. Stage 1: Exploration. Stage 2: Model building and validation. Stage 3: Deployment. The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. For information on Data Mining techniques, review the summary topics included below. Berry, M., J., A., & Linoff, G., S., (2000). Edelstein, H., A. (1999). Fayyad, U. Weiss, S.

Tube – Epic Production Notes | 3D animated filmmaking in free software and the commons