Scraping · chriso/node.io Wiki Node.io includes a robust framework for scraping data from the web. The primary methods for scraping data are get and getHtml, although there are methods for making any type of request, modifying headers, etc. See the API for a full list of methods. Scraping for Journalism: A Guide for Collecting Data Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. The most difficult-to-scrape site was actually a previous Adobe Flash incarnation of Eli Lilly’s disclosure site. Web scraping Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
The Overview Project » About Overview is an open-source tool to help journalists find stories in large numbers of documents, by automatically sorting them according to topic and providing a fast visualization and reading interface. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read. Overview does at least three things really well. Find what you don’t even know to look for.Find broad trends or patterns across many documents.Make exhaustive manual reading faster, when all else fails. Search is a wonderful tool when you know what you’re trying to find — and Overview includes advanced search features. It’s less useful when you start with a hunch or an anonymous tip.
How to think like a computer: 5 tips for a data journalism workflow part 3 This is the final part of a series of blog posts. The first explains how using feeds and social bookmarking can make for a quicker data journalism workflow. The second looks at how to anticipate and prevent problems; and how collaboration can improve data work. The final workflow tip is all about efficiency. Computers deal with processes in a logical way, and good programming is often about completing processes in the simplest way possible. If you have any tasks that are repetitive, break them down and work out what patterns might allow you to do them more quickly – or for a computer to do them. Features Ready for Mission Critical Applications Simple to Use You can be up and running with Spinn3r in less than an hour. We ship a standard reference client that integrates directly with your pipeline. If you're running Java, you can get up and running in minutes. If you're using another language, you only need to parse out a few XML files every few seconds.
Chapter 1. Using Google Refine to Clean Messy Data Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine:
DataMachine - jwpl - Documentation of the JWPL DataMachine - Java-based Wikipedia Library Back to overview page. Learn about the different ways to get JWPL and choose the one that is right for you! (You might want to get fatjars with built-in dependencies instead of the download package on Google Code) Download the Wikipedia data from the Wikimedia Download Site You need 3 files: [LANGCODE]wiki-[DATE]-pages-articles.xml.bz2 OR [LANGCODE]wiki-[DATE]-pages-meta-current.xml.bz2 [LANGCODE]wiki-[DATE]-pagelinks.sql.gz [LANGCODE]wiki-[DATE]-categorylinks.sql.gz Note: If you want to add discussion pages to the database, use [LANGCODE]wiki-[DATE]-pages-meta-current.xml.bz2, otherwise [LANGCODE]wiki-[DATE]-pages-articles.xml.bz2 suffices.
75+ Tools for Visualizing your Data, CSS, Flash, jQuery, PHP Most people would agree that the old adage “A picture is worth a thousand words” is also true for web based solutions. There should be no discussion – Charts and Graphs are ideal to visualize data in order to quickly deliver an overview and communicate key messages. Whatever type of data presentation you prefer or suits you data (pie charts, bubble charts, bar graphs, network diagrams etc.), there are many different options but how do you get started and what is technologically possible? Was bombshell singled out because of her looks, pageant queen status? Leave it to a lawyer. You may have read in the news recently the fascinating story of one Kendra McKenzie Gill, the bomb-throwing beauty queen. Miss Gill has had quite the eventful summer.
Web-Harvest Project Home Page 1. Welcome screen with quick links 2. Web-Harvest XML editing with auto-completion support (Ctrl + Space) 3. Defining initial variables that are pushed to the Web-Harvest context before execution starts Chapter 2: Reading Data from Flash Sites Flash applications often disallow the direct copying of data from them. But we can instead use the raw data files sent to the web browser. Adobe Flash can make data difficult to extract. This tutorial will teach you how to find and examine raw data files that are sent to your web browser, without worrying how the data is visually displayed. For example, the data displayed on this Recovery.gov Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page. Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website.
Getting Started with HtmlUnit Introduction The dependencies page lists all the jars that you will need to have in your classpath. The class com.gargoylesoftware.htmlunit.WebClient is the main starting point. This simulates a web browser and will be used to execute all of the tests. Most unit testing will be done within a framework like JUnit so all the examples here will assume that we are using that. In the first sample, we create the web client and have it load the homepage from the HtmlUnit website.
Print or online? One masterpiece and one screw-up – VisualJournalism THE NEW YORK TIMES just ran an interesting article titled ‘It’s All Connected: A Spectator’s Guide to the Euro Crisis’ and the intro ending with ‘The graphic here helps you see the intertwined complexities.’ They also ran an interactive visualization online with the same title, but with the intro ending in ‘Here is a visual guide to the crisis’. Pretty much the same stuff – except that I challenge you to understand and gain insight from the online-version: See it here: Before reading the print-version. See it here: The printed version has a lot of text, which leads you through the story and educates you along the way on a highly complex system.