Scraping · chriso/node.io Wiki Node.io includes a robust framework for scraping data from the web. The primary methods for scraping data are get and getHtml, although there are methods for making any type of request, modifying headers, etc. See the API for a full list of methods. Scraping for Journalism: A Guide for Collecting Data Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. The most difficult-to-scrape site was actually a previous Adobe Flash incarnation of Eli Lilly’s disclosure site. Web scraping Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
The Overview Project » About Overview is an open-source tool to help journalists find stories in large numbers of documents, by automatically sorting them according to topic and providing a fast visualization and reading interface. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read. Overview does at least three things really well. Find what you don’t even know to look for.Find broad trends or patterns across many documents.Make exhaustive manual reading faster, when all else fails. Search is a wonderful tool when you know what you’re trying to find — and Overview includes advanced search features. It’s less useful when you start with a hunch or an anonymous tip.
How Can I Protect My Computers and Data When Someone Else Is Using My Network? Dear Lifehacker, After reading how easy it is for someone else to get onto my Wi-Fi network, and, similarly, thinking about how often I let my friends connect to my wireless network, I want to lock down the rest of my network so people connected to it can't go snooping around my computers—or at least secure my most super secret files and folders. What's the best way to go about this? P
Features Ready for Mission Critical Applications Simple to Use You can be up and running with Spinn3r in less than an hour. We ship a standard reference client that integrates directly with your pipeline. If you're running Java, you can get up and running in minutes. If you're using another language, you only need to parse out a few XML files every few seconds.
Chapter 1. Using Google Refine to Clean Messy Data Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine:
DataMachine - jwpl - Documentation of the JWPL DataMachine - Java-based Wikipedia Library Back to overview page. Learn about the different ways to get JWPL and choose the one that is right for you! (You might want to get fatjars with built-in dependencies instead of the download package on Google Code) Download the Wikipedia data from the Wikimedia Download Site You need 3 files: [LANGCODE]wiki-[DATE]-pages-articles.xml.bz2 OR [LANGCODE]wiki-[DATE]-pages-meta-current.xml.bz2 [LANGCODE]wiki-[DATE]-pagelinks.sql.gz [LANGCODE]wiki-[DATE]-categorylinks.sql.gz Note: If you want to add discussion pages to the database, use [LANGCODE]wiki-[DATE]-pages-meta-current.xml.bz2, otherwise [LANGCODE]wiki-[DATE]-pages-articles.xml.bz2 suffices.
75+ Tools for Visualizing your Data, CSS, Flash, jQuery, PHP Most people would agree that the old adage “A picture is worth a thousand words” is also true for web based solutions. There should be no discussion – Charts and Graphs are ideal to visualize data in order to quickly deliver an overview and communicate key messages. Whatever type of data presentation you prefer or suits you data (pie charts, bubble charts, bar graphs, network diagrams etc.), there are many different options but how do you get started and what is technologically possible? Non-Programmer's Tutorial for Python 3/Print version All example Python source code in this tutorial is granted to the public domain. Therefore you may modify it and relicense it under any license you please. Since you are expected to learn programming, the Creative Commons Attribution-ShareAlike license would require you to keep all programs that are derived from the source code in this tutorial under that license.
Web-Harvest Project Home Page 1. Welcome screen with quick links 2. Web-Harvest XML editing with auto-completion support (Ctrl + Space) 3. Defining initial variables that are pushed to the Web-Harvest context before execution starts Chapter 2: Reading Data from Flash Sites Flash applications often disallow the direct copying of data from them. But we can instead use the raw data files sent to the web browser. Adobe Flash can make data difficult to extract. This tutorial will teach you how to find and examine raw data files that are sent to your web browser, without worrying how the data is visually displayed. For example, the data displayed on this Recovery.gov Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page. Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website.
Getting Started with HtmlUnit Introduction The dependencies page lists all the jars that you will need to have in your classpath. The class com.gargoylesoftware.htmlunit.WebClient is the main starting point. This simulates a web browser and will be used to execute all of the tests. Most unit testing will be done within a framework like JUnit so all the examples here will assume that we are using that. In the first sample, we create the web client and have it load the homepage from the HtmlUnit website.
Print or online? One masterpiece and one screw-up – VisualJournalism THE NEW YORK TIMES just ran an interesting article titled ‘It’s All Connected: A Spectator’s Guide to the Euro Crisis’ and the intro ending with ‘The graphic here helps you see the intertwined complexities.’ They also ran an interactive visualization online with the same title, but with the intro ending in ‘Here is a visual guide to the crisis’. Pretty much the same stuff – except that I challenge you to understand and gain insight from the online-version: See it here: Before reading the print-version. See it here: The printed version has a lot of text, which leads you through the story and educates you along the way on a highly complex system.