background preloader

Scraping for Journalism: A Guide for Collecting Data

Scraping for Journalism: A Guide for Collecting Data
Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. The most difficult-to-scrape site was actually a previous Adobe Flash incarnation of Eli Lilly’s disclosure site. Lilly has since released their data in PDF format. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you. The tools With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source. A Guide to the Guides

http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data

Related:  PDF ScrapingCollecting and Scraping dataGuías, manuales y cursos

Converting PDFs to Usable Data for the International Journalism Festival 2012 Dan Nguyen twitter: @dancow / @propublica April 26, 2012 Shortlink: Note: This guide only covers the better known, more useful methods. There are dozens of programs and websites you'll find if you do a Google search for "convert PDF to Excel" Scraping for… by Paul Bradshaw Scraping - getting a computer to capture information from online sources - is one of the most powerful techniques for data-savvy journalists who want to get to the story first, or find exclusives that no one else has spotted. Faster than FOI and more detailed than advanced search techniques, scraping also allows you to grab data that organisations would rather you didn’t have - and put it into a form that allows you to get answers. Scraping for Journalists introduces you to a range of scraping techniques - from very simple scraping techniques which are no more complicated than a spreadsheet formula, to more complex challenges such as scraping databases or hundreds of documents. At every stage you'll see results - but you'll also be building towards more ambitious and powerful tools.

Data VisualizationTutorials KDMC produces a wealth of digital media tutorials to support our training sessions and classes. While the focus of some tutorials is on technology and journalism, most are general enough to be of use to anyone. Spreadsheets Updated March 11, 2012 in Data Visualization This tutorial covers the basics of creating and doing calculations with a spreadsheet.

Protovis Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layouts to simplify construction. Protovis is free and open-source, provided under the BSD License. Cleaning data using Google Refine: a quick guide I’ve been focusing so much on blogging the bells and whistles stuff that Google Refine does that I’ve never actually written about its most simple function: cleaning data. So, here’s what it does and how to do it: Download and install Google Refine if you haven’t already done so. It’s free.Run it – it uses your default browser.In the ‘Create a new project’ window click on ‘Choose file‘ and find a spreadsheet you’re working with.

My Life Log: Scraping PDF's in Python So, in the course of grabbing some additional data sources for GovCheck, I needed to scrape a few pdf's and insert the information into my database. After looking high and low, I found an acceptable solution to do this using Python - pdfminer. It's not perfect, but it's much better than the rest of the pdf to html/txt converter tools - at-least as far as scraping goes. So I figured I'd note here how I wrote my scraping code. As a reference point, I was parsing election data for the past election using this pdf file.You start off with running the code through pdfminer and getting the resulting HTML back. import os from BeautifulSoup import BeautifulSoup for page in range(9, 552): soup = BeautifulSoup(os.popen('python ~/dev/pdfminer-dist-20090330/pdflib/pdf2txt.py -w -p %d Vol_II_LS_2004.pdf' % page).read())

PDF2XL FREE Download by CogniView Find out how you can convert even a 500 page-long PDF document to an Excel spreadsheet in just 5 minutes! Here’s how simple it is: Use PDF2XL to view your PDF document. The ProPublica Nerd Blog We used WebGL to create the 3-D map of FEMA's new flood zones. Earlier this year we published a story and an interactive graphic about the evolving Federal Emergency Management Agency flood maps in New York City in the year after Hurricane Sandy. FEMA had advisory maps in the works when Sandy hit. The agency rushed them out in the days afterward as a first sketch for those looking to rebuild. Our story found that while the maps continued to be revised over the course of a year, homeowners had little guidance on how much their home’s value — as well as its required elevation — were changing as they struggled to rebuild after the storm.

Visual Fusion - Software Components Schedule a Live Demo! Contact us today for a private tour of the Visual Fusion platform. SCHEDULE TOUR Map-based visualizations for everyone, ready in days, not months Visual Fusion is a platform for building map-based solutions that organizations can use to increase strategic insight and make informed decisions. 10 Awesome Free Tools To Make Infographics Advertisement Who can resist a colourful, thoughtful venn diagram anyway? In terms of blogging success, infographics are far more likely to be shared than your average blog post. Scraping PDFs: now 26% less unpleasant with ScraperWiki Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job. Coders try to avoid scraping PDFs if there’s any other option. But sometimes, there isn’t – the data you need is locked up inside inaccessible PDF files. So I’m pleased to present the PDF to HTML Preview, a tool written by ScraperWiki’s Julian Todd to ease the pain of scraping PDFs. Just enter the URL of your PDF to see a preview in the browser.

Convert PDF to Excel, Word with the PDF Converter - Able2Extract Open Select Convert Advanced PDF Handling Need image (scanned) PDF conversion to Excel, Word, and PowerPoint? Able2Extract Professional combines leading edge technology with our proprietary PDF conversion algorithm to deliver high quality conversions every time.

Free Online Data Training Data visualization basic training; from spreadsheet to data mapping. kdmcBerkeley is offering four free online training courses in data journalism. You'll learn basic data visualization skills, from spreadsheets to data mapping. Each of the four one-hour long courses builds upon the other; register for all four sessions or choose the session that best meets your needs. Each course is offered twice, once at 10am PST and then again at 1pm PST. Registration is limited to the first 200 participants per session and registration for all four courses is now open. Spreadsheet Basics (Completed) Wrangler UPDATE: The Stanford/Berkeley Wrangler research project is complete, and the software is no longer actively supported. Instead, we have started a commercial venture, Trifacta. For the most recent version of the tool, see the free Trifacta Wrangler. Why wrangle? Too much time is spent manipulating data just to get analysis and visualization tools to read it. Wrangler is designed to accelerate this process: spend less time fighting with your data and more time learning from it.

Related:  Scrap Datawanderworries