background preloader

Scraping for Journalism: A Guide for Collecting Data

Scraping for Journalism: A Guide for Collecting Data
Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. The most difficult-to-scrape site was actually a previous Adobe Flash incarnation of Eli Lilly’s disclosure site. Lilly has since released their data in PDF format. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you. The tools With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source. A Guide to the Guides

http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data

Cleaning data using Google Refine: a quick guide I’ve been focusing so much on blogging the bells and whistles stuff that Google Refine does that I’ve never actually written about its most simple function: cleaning data. So, here’s what it does and how to do it: Download and install Google Refine if you haven’t already done so. It’s free.Run it – it uses your default browser.In the ‘Create a new project’ window click on ‘Choose file‘ and find a spreadsheet you’re working with.

Data Extraction Data Extraction and Web Scraping A key activity in web automation is the extraction of data from websites, also known as web scraping or screen scraping. Whether it is price lists, stock information, financial data or any other type of data, iMacros can extract this data for you and either re-use the data or store it in a file or database. iMacros can write extracted data to standard text files, including the comma separated value (.csv) format, readable by spreadsheet processing packages. Also, iMacros can make use of the powerful scripting interface to save data directly to databases. The Extract command

Data VisualizationTutorials KDMC produces a wealth of digital media tutorials to support our training sessions and classes. While the focus of some tutorials is on technology and journalism, most are general enough to be of use to anyone. Spreadsheets Updated March 11, 2012 in Data Visualization This tutorial covers the basics of creating and doing calculations with a spreadsheet.

10 Awesome Free Tools To Make Infographics Advertisement Who can resist a colourful, thoughtful venn diagram anyway? In terms of blogging success, infographics are far more likely to be shared than your average blog post. Chapter 1. Using Google Refine to Clean Messy Data Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine:

The ProPublica Nerd Blog We used WebGL to create the 3-D map of FEMA's new flood zones. Earlier this year we published a story and an interactive graphic about the evolving Federal Emergency Management Agency flood maps in New York City in the year after Hurricane Sandy. FEMA had advisory maps in the works when Sandy hit. The agency rushed them out in the days afterward as a first sketch for those looking to rebuild. Our story found that while the maps continued to be revised over the course of a year, homeowners had little guidance on how much their home’s value — as well as its required elevation — were changing as they struggled to rebuild after the storm.

Chapter 2: Reading Data from Flash Sites Flash applications often disallow the direct copying of data from them. But we can instead use the raw data files sent to the web browser. Adobe Flash can make data difficult to extract. This tutorial will teach you how to find and examine raw data files that are sent to your web browser, without worrying how the data is visually displayed. For example, the data displayed on this Recovery.gov Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page. Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website.

Free Online Data Training Data visualization basic training; from spreadsheet to data mapping. kdmcBerkeley is offering four free online training courses in data journalism. You'll learn basic data visualization skills, from spreadsheets to data mapping. Each of the four one-hour long courses builds upon the other; register for all four sessions or choose the session that best meets your needs. Each course is offered twice, once at 10am PST and then again at 1pm PST. Registration is limited to the first 200 participants per session and registration for all four courses is now open. Spreadsheet Basics (Completed) Chapter 4: Scraping Data from HTML Web-scraping is essentially the task of finding out what input a website expects and understanding the format of its response. For example, Recovery.gov takes a user's zip code as input before returning a page showing federal stimulus contracts and grants in the area. This tutorial will teach you how to identify the inputs for a website and how to design a program that automatically sends requests and downloads the resulting web pages. Pfizer disclosed its doctor payments in March as part of a $2.3 billion settlement - the largest health care fraud settlement in U.S. history - of allegations that it illegally promoted its drugs for unapproved uses. Of the disclosing companies so far, Pfizer's disclosures are the most detailed and its site is well-designed for users looking up individual doctors.

Journalism and Media Studies Center at the University of Hong Kong, Spring 2013 For this assignment you will evaluate the performance of OpenCalais, a commercial entity extraction service. You’ll do this by building a text enrichment program, which takes plain text and outputs HTML with links to the detected entities. Then you will take five random articles from your data set, enrich them, and manually count how many entities OpenCalais missed or got wrong. 1. Get an OpenCalais API key, from this page. 2. Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state. Update (4/28): Replaced the code and result files. Periodismo de datos - Grupo de trabajo What is the work group on data journalism? Schedule Registration for the work group

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully. UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby. I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think: Who this post is for

Mind the Map: Toward a Handbook for Journalists “What is it we want our maps to be now, if no longer a single authoritative view or the world?” - Brooke Gladstone, Host of NPR’s On the Media Maps are rhetorical devices.

Related: