Freebase Parallax provides a new way to browse and explore data in Freebase.
This is part of a four-part series on web-scraping for journalists . As of Apr. 5, 2010 , it was a published a bit incomplete because I wanted to post a timely solution to the recent Pfizer doctor payments list release , but the code at the bottom of each tutorial should execute properly.
Coding for Journalists 102: Who’s in Jail Now: Collecting info from a county jail site | Dan Nguyen pronounced fast is danwinA note about privacy : This tutorial uses files that I archived from a real-world jail website.
Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully. | Dan Nguyen pronounced fast is danwinUPDATE (12/1/2011) : Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code.
Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List | Dan Nguyen pronounced fast is danwinUpdate (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica . Heed that one. This one will remain in its obsolete state.
Web-scraping is essentially the task of finding out what input a website expects and understanding the format of its response.
Flash applications often disallow the direct copying of data from them.
Google Refine  (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management.
tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.Tesseract is probably the most accurate open source OCR engine available.
Photo by Dan Nguyen/ProPublica
By admin on Jan 6, 2008 in Java , Programming
Download Mozenda The Mozenda application has begun downloading to your computer.
I've got a task that is just screaming for automation. Every week, I have to get a number for each of 36 entities for some metrics I do and that basically consists of counting the 'Y's in a certain column in a table on a company web page.