40 Techniques Used by Data Scientists These techniques cover most of what data scientists and related practitioners are using in their daily activities, whether they use solutions offered by a vendor, or whether they design proprietary tools. When you click on any of the 40 links below, you will find a selection of articles related to the entry in question. Most of these articles are hard to find with a Google search, so in some ways this gives you access to the hidden literature on data science, machine learning, and statistical science. Many of these articles are fundamental to understanding the technique in question, and come with further references and source code. Starred techniques (marked with a *) belong to what I call deep data science, a branch of data science that has little if any overlap with closely related fields such as machine learning, computer science, operations research, mathematics, or statistics. To learn more about deep data science, click here. The 40 data science techniques DSC Resources
R scripts for analyzing survey data Another site pops up with open code for analyzing public survey data: It will be interesting to see whether this gets used by the general public--given the growing trend of data journalism and so forth--versus academics. It is a useful resource for both. To leave a comment for the author, please follow the link and comment on his blog: The Data Monkey. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Marketplace It takes a lot of data of all kinds to add up to Big Data. That's why we've assembled this awesome collection of datasets and streams for data scientists and developers to sample, experiment with and use to create awesome analytics and applications. Whether you need to tap into social, geo or other kinds of data we've got just what you need. agebirthscensuscharacterchemistrycommoditiescorporadeathdemographicsdemographicseconomicsemploymentfootballgeonamesgovernmenthealthhealthhousingincomelanguagelanguagelawliteraturelocationslongitudemapsmusicnationalpollutionpopulationsciencesciencesize-largesocialsocialspendingsportsstatisticssurveytwitterwordzipcode
Handy statistical lexicon These are all important methods and concepts related to statistics that are not as well known as they should be. I hope that by giving them names, we will make the ideas more accessible to people: Mister P: Multilevel regression and poststratification. The Secret Weapon: Fitting a statistical model repeatedly on several different datasets and then displaying all these estimates together. The Superplot: Line plot of estimates in an interaction, with circles showing group sizes and a line showing the regression of the aggregate averages. The Folk Theorem: When you have computational problems, often there’s a problem with your model. The Pinch-Hitter Syndrome: People whose job it is to do just one thing are not always so good at that one thing. Weakly Informative Priors: What you should be doing when you think you want to use noninformative priors. P-values and U-values: They’re different. Conservatism: In statistics, the desire to use methods that have been used before. P.S.
Get JSON from Excel using Python, xlrd | Anthony DeBarros Powering interactive news applications off flat files rather than a call to a database server is an option worth considering. Cutting a production database and data access layer out of the mix eliminates a whole slice of complexity and trims development time. Flat files aren’t right for every situation, but for small apps they’re often all you need. These days, most of the apps I help build at Gannett Digital consume JSON. I wrote last year how to use Python to generate JSON files from a SQL database. The key ingredient is the Python library xlrd. (Another choice is openpyxl, which has similar features and works with newer .xlsx formatted Excel files. Basic xlrd operations Let’s say we have an Excel workbook containing a small table repeated over three worksheets. Here are some snippets of code — just scratching the surface — to interact with it programmatically: From Excel to JSON Pretty cool stuff. Add each cell to a key/value pair in a dictionary, then add each dictionary to a list.
Toward sustainable insights, or why polygamy is bad for you | the morning paper Toward sustainable insights, or why polygamy is bad for you Binning et al., CIDR 2017 Buckle up! Today we’re going to be talking about statistics, p-values, and the multiple comparisons problem. Some good background resources here are: For my own benefit, I’ll try and explain what follows as simply as possible – I find it incredibly easy to make mistakes otherwise! Let’s start with a very quick recap of p-values. p-values If we observe some variable and see value , we might wonder “what are the odds of that!” we’d be able to give an answer. about the underlying distribution. will be given that hypothesis, or : . Time to move on from dice rolls. we observe is now a measure of correlation between two measured phenomena. exactly equal to some value we need to ask ‘what are the odds of seeing a value (or )?’ . Suppose we see a suspiciously large value. p-value = (source: wikipedia) Here’s the first thinking trap. Multiple comparisons Take a look at this xkcd cartoon. Let’s take a concrete example. . .
Tableau Tip: Embedding dashboards from multiple, disparate workbooks into a single workbook Here’s the situation: You have several people creating their own dashboards in separate workbooks Your boss doesn’t want to open all of the dashboards separately You need all of these diverse dashboards in a single dashboard The solution is way easier than you think. For this example, I’m going to use these three views: Here’s how it’s done: Step 1 – You have to connect to some kind of data. Data as simple as this will work: Step 2 – Create a new dashboard Step 3 – Add a web page object to the dashboard. You should now see your dashboard inside the dashboard. Notice also that if you have tabs enabled on the dashboard, you can see all of those tabs in the web page object. Step 4 – Repeats steps 2 & 3 for each of the views in the other workbooks. That’s it!
Majority to minority: the declining U.S. white population]Quand la majorité devient minorité : le cas des blancs aux Etats-Unis | N-IUSSP In this essay we document the demography of the decline of the white population in the United States, a country with a long history of white supremacy. Despite the fact that the U.S. Constitution and the civil rights legislation of the 1960s guaranteed equality to all people irrespective of race or ethnicity, everyone is far from equal in the United States today. On average, whites are far better off economically and educationally and in many other ways than are minority peoples. Donald Trump won the U.S. presidential election by focusing his attention on white people. Let us first note that U.S. federal government agencies use two questions to measure race/ethnicity. First immigrants to the U.S. Whites were not the first people to settle in what is now the United States. The first sizeable stream of immigrants to the U.S. were whites from England. Whites and minorities in the U.S. The nonwhite (minority) population is growing much more rapidly than the white population. References U.S.
The Pith of Performance: Extracting the Epidemic Model: Going Beyond Florence Nightingale Part II This is the second of a two part reexamination of Florence Nightingale's data visualization based on her innovative cam diagrams (my term) shown in Figure 1. Figure 1. Nightingale's original cam diagrams (click to enlarge) Recap In Part I, I showed that FN applied sectoral areas, rather than a pie chart or conventional histogram, to reduce the visual impact of highly variable zymotic disease data from the Crimean War. Figure 2. Although R has some very sophisticated tools for producing almost any kind of data visualization—some of which I used in Part I—a very impressive interactive version of Fig. 1 has been created by the Statistical Laboratory at Cambridge University using Adobe Flash (may need browser plug-in). This kind of visualization requires a lot of work to construct and is perfect as an educational tool but complete overkill for typical data exploration. Back to Bar Charts Using R to create Fig. 3, I've reproduced a static version of the interactive Flash bar chart. Figure 3.
Beware of Zombie Statistics … Even When It’s Not Halloween (October 2017) Do women really own less than 2 percent of the world’s land? Do women constitute 70 percent of the world’s poor? Do women provide between 60 percent and 80 percent of the agricultural labor in Africa? Do widely cited statistics like these mean they are backed up with solid research? No, but they are repeated often enough that they have attained the status of official fact. Zombie statistics actually can be their own worst enemy. In her blog, Doss assessed the credibility of the claim that women own less than 2 percent of the world’s land. Women’s land rights are important, Doss says, but flawed data won’t resolve the issue. Another widely cited statistical zombie is that African women supply 60 percent to 80 percent of agricultural labor on the continent. A different zombie awoke when Carly Fiorina, a few months before she entered the GOP presidential campaign, said that “70 percent of the people living in abject poverty are women.”
The Pith of Performance: Going Beyond Florence Nightingale's Data Diagram: Did Flo Blow It with Wedges? In 2010, I wrote a short blog item about Florence Nightingale the statistician, solely because of its novelty value. I didn't even bother to look closely at the associated graphic she designed, but that's what I intend to do here. In this first installment, I reflect on her famous data visualization by reconstructing it with the modern tools available in R. Although Florence Nightingale was not formally trained as a statistician, she apparently had a natural aptitude for mathematical concepts and evidently put a lot of thought into presenting the import of her medical findings in a visual way. Why Wedges? Why did FN bother to construct the data visualization in Figure 1? Today, it is hard for us to fully appreciate how innovative her ideas were at that time, and the resistance with which they were met. The visual message of FN's diagram was essentially this. Figure 2. Cam Diagrams in R require(plotrix) require(zoo) fn <- read.table(".. The main argument does not work. Figure 3. Figure 4.
Las consecuencias de no renovar el censo durante más de una década en Colombia Una persona con la piel oscura puede ser un negro, un afrocolombiano, un afrodescendiente, un libre, un renaciente, un palenquero, un moreno, un raizal o formar parte de la costeñidad en Colombia. La herencia africana y su posterior mestizaje se entienden de tantas maneras como sensibilidades se presentan, aunque sobre el papel sea difícil de explicar. La última vez que se contó a los colombianos fue en el censo de 2005 elaborado por el DANE (Departamento Administrativo Nacional de Estadística). En ese momento se dibujó un mapa en el que la población afro era algo más del 10% de los 41 millones de habitantes que se registraron. Una década después, las proyecciones superan los 48 y estos pueblos representan entre el 18% y el 20%, según datos de instituciones paralelas como la Universidad del Valle en Cali. “¿Si no se sabe cuántos somos cómo se van a aplicar políticas públicas, cómo podemos reclamar nuestros derechos?” Un tercer factor afecta a estos pueblos.
DRM be damned: How to protect your Amazon e-books from being deleted If you buy e-books from Amazon and want to engage in a bit of digital civil disobedience—by stripping the files’ DRM and making sure that Amazon can’t deny you access—we’re about to show you how. Yes, many parts of the Internet have known about this technique for some time now, but we feel that it bears mentioning again here. Over the past week, the tech world has been abuzz with news that—surprise, surprise—Amazon can remotely wipe any Kindle, at any time, for effectively any reason. (The company did it before, ironically, with George Orwell’s 1984, back in 2009.) This week’s case involves a Norwegian woman (Google Translate) named Linn Nygaard. "I have not heard anything from Amazon about this, except that I got a very strange phone [call] earlier from someone with a hidden number," Nygaard told Norwegian broadcaster NRK. Many speculated that because she was buying content licensed for the UK from Norway, Nygaard somehow ran afoul of Amazon’s licensing deals.