background preloader

WebScraping

Facebook Twitter

Data Scraping- Part III: Python. Collecting Flight Data from Bureau of Transportation with Python This is Part III in our Data Scraping blog post series.

Data Scraping- Part III: Python

Part I by Jewel Loree shows how to scrape websites using IFTTT and Part II by Mike Klaczynski provides an Import.io walk-through. One particularly interesting data set, and one that is not very straightforward in harvesting, is the flight data provided by the Research and Innovative Technology Administration of the Bureau of Transportation Statistics, or RITA. So what is so compelling about this data set?

Well, besides it being flight data, it has geographical data, it has datetime data, is very wide, very flat, and very large, very very deep (it dates back decades). RITA data, fortunately comes to us as a ZIP file containing a CSV, which is easy to work with in Tableau … well easy enough to work with in Tableau, it is still a rather large and wide dataset (as mentioned above). We can download the data! Here is a snapshot of what we are presented with; Step 1. Step 2. Data Scraping- Part II: Import.io.

This is Part III in our Data Scraping blog post series.

Data Scraping- Part II: Import.io

Part I by Jewel Loree shows how to scrape websites using IFTTT and Part III by Isaac Obezo provides a walk-through for using Python. Today, as part of our Tableau Public Data Month we’re focusing on yet another excellent and easy to use data scraping tool, Import.io. Unlike IFTTT, which Jewel reviewed earlier this week, import.io allows you to scrape any website, not just ones that have special connectors, but it does require a bit more steps to setup. To showcase the power of import.io we will do a step by step walkthrough of how to collect data from filmaffinity.com, a movie review site similar to IMDB.

These walkthrough steps are universal and will apply to any site you scrape in the future. Once we scrape the website we'll have a CSV file with all of the collected data that we can use in Tableau Public to build vizzes and share with the world. Import.io Walkthrough Next Steps. Data Scraping- Part I: IFTTT. This is Part I in our Data Scraping blog post series.

Data Scraping- Part I: IFTTT

Part II by Mike Klaczynski provides an Import.io walk-through and Part III by Isaac Obezo shows how to write scripts for scraping using Python. You may have heard that August is Data Month. All month, we will be providing you with ideas on where to get data and how to use it in Tableau. Earlier, I showed you how to optimize your large data sets for performance and Ben Jones shared a few resources for Open Data. This is the first post in a series on Data Scraping, where we will be sharing a few different methods of collecting data from the web. IFTTT (pronounced like “gift” without the g) stands for “If this, then that.” Once you set up a recipe in IFTTT, it runs in the background for as long as you have it turned on.

Last Spring, I used IFTTT to scrape Craigslist posts for tickets to the Sasquatch Music Festival. To get the data, I used the Craigslist Search recipe listed above.

Regular Expressions

Comparisons. Commercial ($) Free or Open source. Tutorials. Webscraping with Python. Webservices for Scraping (free & $$)