
Scrap Data
freebase-parallax - New way to browse and explore data in Freebase
This is part of a four-part series on web-scraping for journalists . As of Apr. 5, 2010 , it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release , but the code at the bottom of each tutorial should execute properly.
Coding for Journalists 103: Who’s been in jail before: Cross-checking the jail log with the court system; Use Ruby’s mechanize to fill out a form | Dan Nguyen pronounced fast is danwin
Coding for Journalists 102: Who’s in Jail Now: Collecting info from a county jail site | Dan Nguyen pronounced fast is danwin
Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully. | Dan Nguyen pronounced fast is danwin
UPDATE (12/1/2011) : Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code.Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List | Dan Nguyen pronounced fast is danwin
Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica . Heed that one. This one will remain in its obsolete state.Chapter 4: Scraping Data from HTML
Chapter 2: Reading Data from Flash Sites
Google Refine [2] (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management.
Chapter 1. Using Google Refine to Clean Messy Data
tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.
From iMacros
Data Extraction
By admin on Jan 6, 2008 in Java , Programming
Java - Writing a Web Page Scraper or Web Data Extraction Tool
Download Mozenda The Mozenda application has begun downloading to your computer.

