40 Techniques Used by Data Scientists These techniques cover most of what data scientists and related practitioners are using in their daily activities, whether they use solutions offered by a vendor, or whether they design proprietary tools. When you click on any of the 40 links below, you will find a selection of articles related to the entry in question. Most of these articles are hard to find with a Google search, so in some ways this gives you access to the hidden literature on data science, machine learning, and statistical science. Many of these articles are fundamental to understanding the technique in question, and come with further references and source code. Starred techniques (marked with a *) belong to what I call deep data science, a branch of data science that has little if any overlap with closely related fields such as machine learning, computer science, operations research, mathematics, or statistics. To learn more about deep data science, click here.
R scripts for analyzing survey data Another site pops up with open code for analyzing public survey data: It will be interesting to see whether this gets used by the general public--given the growing trend of data journalism and so forth--versus academics. It is a useful resource for both. To leave a comment for the author, please follow the link and comment on his blog: The Data Monkey. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
IEEE DataPort - IEEE Big Data IEEE DataPort™ is now available for use! Go to ieee-dataport.org to be connected to this valuable one-stop shop data repository serving the growing technical community focused on Big Data! Contact Melissa Handa today at email@example.com for a coupon code to become a subscriber free of charge! Share, Access and Analyze Big Data with IEEE DataPort™!
Handy statistical lexicon These are all important methods and concepts related to statistics that are not as well known as they should be. I hope that by giving them names, we will make the ideas more accessible to people: Mister P: Multilevel regression and poststratification. The Secret Weapon: Fitting a statistical model repeatedly on several different datasets and then displaying all these estimates together. The Superplot: Line plot of estimates in an interaction, with circles showing group sizes and a line showing the regression of the aggregate averages. The Folk Theorem: When you have computational problems, often there’s a problem with your model. Get JSON from Excel using Python, xlrd Powering interactive news applications off flat files rather than a call to a database server is an option worth considering. Cutting a production database and data access layer out of the mix eliminates a whole slice of complexity and trims development time. Flat files aren’t right for every situation, but for small apps they’re often all you need. These days, most of the apps I help build at Gannett Digital consume JSON.
The Cure for Cancer Is Data—Mountains of Data A few years ago Eric Schadt met a woman who had cancer. It was an aggressive form of colon cancer that had come on quickly and metastasized to her liver. She was a young war widow from Mississippi, the mother of two girls she was raising alone, and she had only the health care that her husband’s death benefits afforded her—an overburdened oncologist at a military hospital, the lowest rung on the health care ladder. The polar opposite of cutting-edge medicine. To walk into such a facility with stage 4 metastatic disease is to walk back in time to the world of the unmapped human genome, when “colon cancer” was understood to have a single cause instead of millions of causes resulting in unique variations, when treatment was the same bag of poison, whether you were in Ocean Springs, Mississippi, or Timbuktu.
Toward sustainable insights, or why polygamy is bad for you Toward sustainable insights, or why polygamy is bad for you Binning et al., CIDR 2017 Buckle up! Today we’re going to be talking about statistics, p-values, and the multiple comparisons problem. Some good background resources here are: For my own benefit, I’ll try and explain what follows as simply as possible – I find it incredibly easy to make mistakes otherwise! Let’s start with a very quick recap of p-values. Tableau Tip: Embedding dashboards from multiple, disparate workbooks into a single workbook Here’s the situation: You have several people creating their own dashboards in separate workbooks Your boss doesn’t want to open all of the dashboards separately You need all of these diverse dashboards in a single dashboard The solution is way easier than you think.
Anscombe's quartet All four sets are identical when examined using simple summary statistics, but vary considerably when graphed Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. He described the article as being intended to attack the impression among statisticians that "numerical calculations are exact, but graphs are rough.
The Pith of Performance: Extracting the Epidemic Model: Going Beyond Florence Nightingale Part II This is the second of a two part reexamination of Florence Nightingale's data visualization based on her innovative cam diagrams (my term) shown in Figure 1. Figure 1. Nightingale's original cam diagrams (click to enlarge) The Pith of Performance: Going Beyond Florence Nightingale's Data Diagram: Did Flo Blow It with Wedges? In 2010, I wrote a short blog item about Florence Nightingale the statistician, solely because of its novelty value. I didn't even bother to look closely at the associated graphic she designed, but that's what I intend to do here. In this first installment, I reflect on her famous data visualization by reconstructing it with the modern tools available in R.
Big Data, Data Mining, Predictive Analytics, Statistics, StatSoft Electronic Textbook This free ebook has been provided as a public service since 1995. Statistics: Methods and Applications textbook offers training in the understanding and application of statistics and data mining. It covers a wide variety of applications, including laboratory research (biomedical, agricultural, etc.), business statistics, credit scoring, forecasting, social science statistics and survey research, data mining, engineering and quality control applications, and many others. The Textbook begins with an overview of the relevant elementary (pivotal) concepts and continues with a more in depth exploration of specific areas of statistics, organized by "modules", representing classes of analytic techniques. A glossary of statistical terms and a list of references for further study are included. You have filtered out all documents.
DRM be damned: How to protect your Amazon e-books from being deleted If you buy e-books from Amazon and want to engage in a bit of digital civil disobedience—by stripping the files’ DRM and making sure that Amazon can’t deny you access—we’re about to show you how. Yes, many parts of the Internet have known about this technique for some time now, but we feel that it bears mentioning again here. Over the past week, the tech world has been abuzz with news that—surprise, surprise—Amazon can remotely wipe any Kindle, at any time, for effectively any reason. (The company did it before, ironically, with George Orwell’s 1984, back in 2009.) This week’s case involves a Norwegian woman (Google Translate) named Linn Nygaard.