Scraping for Journalism: A Guide for Collecting Data

Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. The most difficult-to-scrape site was actually a previous Adobe Flash incarnation of Eli Lilly’s disclosure site. Lilly has since released their data in PDF format. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you. The tools With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source. A Guide to the Guides

http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data

Data Extraction Data Extraction and Web Scraping A key activity in web automation is the extraction of data from websites, also known as web scraping or screen scraping. Whether it is price lists, stock information, financial data or any other type of data, iMacros can extract this data for you and either re-use the data or store it in a file or database. iMacros can write extracted data to standard text files, including the comma separated value (.csv) format, readable by spreadsheet processing packages. Also, iMacros can make use of the powerful scripting interface to save data directly to databases. The Extract command

More free Web tools It’s quite possible to find something useful and free online every day. Here are a few sites that might come in handy when you’re looking to send off big files, you need audio or images that won’t get you in copyright trouble or you’re looking to build a portfolio site quickly. 1. WeTransfer.com – a free Web-based service for transfering up to 2GB of files to up to 20 people at once.

Chapter 1. Using Google Refine to Clean Messy Data Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine:

5 Tools for Online Journalism, Exploration and Visualization - ReadWriteCloud In our last post on data journalism, we ran across a number of tools that would be helpful for anyone who is interested in how to make sense of data. The tools represent a renaissance in how we make sense of our information culture. They provide context and meaning to the often baffling world of big data. This is a snapshot of what is available. We are relying on the work done by Paul Bradshaw, whose blog is an excellent source about the new world of data journalism. Factual Chapter 2: Reading Data from Flash Sites Flash applications often disallow the direct copying of data from them. But we can instead use the raw data files sent to the web browser. Adobe Flash can make data difficult to extract. This tutorial will teach you how to find and examine raw data files that are sent to your web browser, without worrying how the data is visually displayed. For example, the data displayed on this Recovery.gov Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page. Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website.

20 Essential Infographics & Data Visualization Blogs In the tradition of Inspired Mag’s huge lists, here goes a new one – all the blogs with cool data visualization eye candy in the same place! Enjoy and leave some comments with suggestions, questions and so on. Information is Beautiful Chapter 4: Scraping Data from HTML Web-scraping is essentially the task of finding out what input a website expects and understanding the format of its response. For example, Recovery.gov takes a user's zip code as input before returning a page showing federal stimulus contracts and grants in the area. This tutorial will teach you how to identify the inputs for a website and how to design a program that automatically sends requests and downloads the resulting web pages. Pfizer disclosed its doctor payments in March as part of a $2.3 billion settlement - the largest health care fraud settlement in U.S. history - of allegations that it illegally promoted its drugs for unapproved uses. Of the disclosing companies so far, Pfizer's disclosures are the most detailed and its site is well-designed for users looking up individual doctors.

14 examples of data visualization on the web Trend spotting A series of websites use APIs and scrape pages to spot and analyze trends: Fan page analytics – Facebook fan page analytics Zoofs – Most talked about YouTube videos on twitter Fflick – Most tweeted movie titles Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state. Update (4/28): Replaced the code and result files.

Data Visualization Platform, Weave, Now Open Source With more and more civic data becoming available and accessible, the challenge grows for policy makers and citizens to leverage that data for better decision-making. It is often difficult to understand context and perform analysis. “Weave”, however, helps. A web-based data visualization tool, Weave enables users to explore, analyze, visualize and disseminate data online from any location at any time. We saw tremendous potential in the platform and have been helping open-source the software, advising on community engagement strategy and licensing. This week, we were excited to see the soft launch of the Weave 1.0 Beta, which went open-source on Wednesday, June 15.

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully. UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby. I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think: Who this post is for

TimelineSetter: Easy Timelines From Spreadsheets, Now Open to All Talking Points Memo used TimelineSetter to create a timeline featuring events in Wisconsin’s public-sector union struggle. Last week we announced TimelineSetter, our new tool for creating beautiful interactive HTML timelines. Today, after a short private beta with some of our fellow news application developers, we’re opening the code to everyone. How to Install If you’ve got Ruby and Rubygems installed, you can get the package by running:

Coding for Journalists 102: Who’s in Jail Now: Collecting info from a county jail site A note about privacy: This tutorial uses files that I archived from a real-world jail website. Though booking records are public record, I make no claims about the legal proceedings involving the inmates who happened to be in jail when I took my snapshot. For all I know, they could have all been wrongfully arrested and therefore don’t deserve to have their name attached in online perpetuity to erroneous charges (even if the site only purports to record who was arrested and when, and not any legal conclusions). For that reason, I’ve redacted the last names of the inmates and randomized their birthdates. This is where the web-scraping you learned in my last tutorial gets useful.

How to: get started in data journalism using Google Fusion Tables An intensity map showing the population density for different ethnic groups in Texas What is it?Google Fusion Tables allows users to create data visualisations such as maps, charts, graphs and timelines. You can see five great examples of data journalism using Google Fusion Tables here. "Google Fusion is easy", claimed James Ball data journalist from the Guardian investigations team and former chief data analyst for Bureau of Investigative Journalism, during a recent talk. "You would say that", I thought.