background preloader


Facebook Twitter

FiveThirtyEight's data journalism workflow with R. FiveThirtyEight is a data journalism site that uses R extensively for charts, stories, and interactives.

FiveThirtyEight's data journalism workflow with R

We've used R for stories covering: p-hacking in nutrition science; how Uber is affecting New York City taxis; workers in minimum-wage jobs; the frequency of terrorism in Europe; the pitfalls in political polling; and many, many more. R is used in every step of the data journalism process: for cleaning and processing data, for exploratory graphing and statistical analysis, for models deploying in real time as and to create publishable data visualizations. We write R code to underpin several of our popular interactives, as well, like the Facebook Primary and our historical Elo ratings of NBA and NFL teams. HUFFPOLLSTER: When Is It OK To Weight Polls By Past Vote? We take a closer look at the practice of weighting by past vote, as used by several pollsters.

HUFFPOLLSTER: When Is It OK To Weight Polls By Past Vote?

Siena College is taking a closer look at a poll that missed by a mile in the Rochester mayor’s race. And tonight we atone for all things Twitter. This is HuffPollster for Friday, September 13, 2013. SHOULD A SURVEY WEIGHT ON PAST VOTING? - Yesterday’s much discussed article by The New Republic’s Nate Cohn revealed that in their 2012 surveys, the Democratic firm Public Policy Polling asked a question about who respondents voted for in 2008 and took that result into account when weighting by demographics.

Geospatial Mapping with D3. Creating Custom Web Maps. Lynchburg, Virginia: The Most Typical City in America. A new visualization to beautifully explore correlations. An ancient curse haunts data analysis.

A new visualization to beautifully explore correlations

The more variables we use to improve our model, the exponentially more data we need. By focusing on the variables that matter, however, we can avoid underfitting, and the need to collect a huge pile of data points. One way of narrowing input variables is to identify their influence on the output variable. Here correlation helps—if the correlation is strong, then a significant change in the input variable results in an equally strong change in the output variable.

Rather than using all available variables, we want to pick input variables strongly correlated to the output variable for our model. There's a catch though—and it arises when the input variables have a strong correlation among themselves. Intercorrelation is the correlation between explanatory variables. Introduction to Time Series Forecasting With Python - Machine Learning Mastery. Discover How to Prepare Data and Develop Models to Predict the Future Time series forecasting is different from other machine learning problems.

Introduction to Time Series Forecasting With Python - Machine Learning Mastery

The key difference is the fixed sequence of observations and the constraints and additional structure this provides. In this mega Ebook written in the friendly Machine Learning Mastery style that you’re used to, finally cut through the math and specialized methods for time series forecasting. Using clear explanations, standard Python libraries and step-by-step tutorials you will discover how to load and prepare data, evaluate model skill, and implement forecasting models for time series data. Technical Details About the Book PDF format Ebook.8 parts, 34 chapters, 367 pages.28 step-by-step tutorial lessons.3 end-to-end projects.181 Python (.py) files. Clear and Complete Examples. Convinced? Time Series Problems are Important. A Course for Visualization in R, Taking You From Beginner to Advanced. It’s the fourth year of running memberships on FlowingData (whoa).

A Course for Visualization in R, Taking You From Beginner to Advanced

With at least one tutorial per month since the beginning, I’ve worked up a pretty good collection, mostly in R. Each tutorial is self-encapsulated. Mining Twitter data with R, TidyText, and TAGS. One of the best places to get your feet wet with text mining is Twitter data.

Mining Twitter data with R, TidyText, and TAGS

Though not as open as it used to be for developers, the Twitter API makes it incredibly easy to download large swaths of text from its public users, accompanied by substantial metadata. A treasure trove for data miners that is relatively easy to parse. It’s also a great source of data for those studying the distribution of (mis)information via digital media. This is something I’ve been working on a lot lately, both in independent projects and in preparation for my courses on Digital Storytelling, Digital Studies, and The Internet. It’s amazing how much data you can get, and how detailed a picture it can paint about how citizens, voters, and activists find and disseminate information.

Tableau. We’re happy to announce the beta release of TabPy, a new API that enables evaluation of Python code from within a Tableau workbook.


When you use TabPy with Tableau 10.1, you can define calculated fields in Python, thereby leveraging the power of a large number of machine-learning libraries right from your visualizations. This new Python integration in Tableau enables powerful scenarios. For example, it takes only a few lines of Python code to get the sentiment scores for reviews of products sold at an online retailer. Then you can explore the results in many ways in Tableau. You might filter to see just the negative reviews and review their content to understand the reasons behind them. Other common business scenarios include: Tableau. In writing about visualizing survey data using Tableau, I’ve found that the number one impediment to success is getting the data in the right format.


With Tableau 10 and on, it is, in fact, possible to get your survey data just so without having to invest in new tools and/or a engage in a time-consuming, error-prone procedure every time you need receive updated survey data. What do I mean by "just so"? When I deal with survey data, there are usually four different elements that need to fit together: The demographic information (e.g., age of respondents, gender, etc.)Survey responses in text formatSurvey responses in numeric formatMetadata that describes the survey data. Tableau. Back in November, we introduced TabPy (currently in beta), making it possible to use Python scripts in Tableau calculated fields.


When you pair Python’s machine-learning capabilities with the power of Tableau, you can rapidly develop advanced-analytics applications that can aid in various business tasks. Let me show you what I mean with an example. Let’s say I’m trying to identify criminal hotspots in Seattle, my hometown. I’ll use data from the Seattle Police Department showing 911 calls for various type of criminal activities in the past few years. With this data, it is really hard to visually identify patterns given the density of activity and noise in GPS readings. Advanced analytics with Python and Tableau 10.1 integration. After introducing R capabilities in Tableau 8.1, the new Tableau 10.1 now comes also with support for Python.

Advanced analytics with Python and Tableau 10.1 integration

This is a great news especially for data scientists, who use the reports to visualize results of some more sophisticated analytical processes. Such reports can now bring the analytics much closer to the end users, while preserving the given level of user-friendliness. In this post I am using a simple modelling example to describe how exactly the integration of Tableau and Python works. While R integration used Rserve and you only needed to create a running Rserve session to enable a connection from Tableau, the Python integration requires you to install and set-up TabPy Server (installation instructions from the Tableau github repository can be found here).

The set-up contains instructions on installing TabPy, installing Python 2.7 with Anaconda and connecting Tableau. Let’s show a few examples now, which will contain: