background preloader

Scraping for… by Paul Bradshaw

Scraping for… by Paul Bradshaw
Scraping - getting a computer to capture information from online sources - is one of the most powerful techniques for data-savvy journalists who want to get to the story first, or find exclusives that no one else has spotted. Faster than FOI and more detailed than advanced search techniques, scraping also allows you to grab data that organisations would rather you didn’t have - and put it into a form that allows you to get answers. Scraping for Journalists introduces you to a range of scraping techniques - from very simple scraping techniques which are no more complicated than a spreadsheet formula, to more complex challenges such as scraping databases or hundreds of documents. At every stage you'll see results - but you'll also be building towards more ambitious and powerful tools. You’ll be scraping within 5 minutes of reading the first chapter - but more importantly you'll be learning key principles and techniques for dealing with scraping problems.

Related:  PDF ScrapingCollecting and Scraping dataData journalismDaily Dals

Scraping for Journalism: A Guide for Collecting Data Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. PDF2XL FREE Download by CogniView Find out how you can convert even a 500 page-long PDF document to an Excel spreadsheet in just 5 minutes! Here’s how simple it is: Use PDF2XL to view your PDF document. Select the data you want to convert on one page, and PDF2XL automatically gives you a preview of your selected data in Excel.Choose whether you want to convert the current page, a page range or all the pages and…Click on the CONVERT button and the selected data pastes instantly into your Excel (XLS) or Word (DOC) file.This super-fast PDF to Excel conversion process along with the simple PDF2XL installation allows you to install, set up, and convert your first PDF file in less than 5 minutes.You will, of course, be able to handle smaller documents, even one-pagers. Follow these 3 steps and you will be able to convert any PDF document to Microsoft Excel within minutes Step 1: Watch the PDF2XL introduction movie

What I Learned Recreating One Chart Using 24 Tools Back in May of this year, I set myself a challenge: I wanted to try as many applications and libraries and programming languages in the field of data visualization as possible. To compare these tools on a level playing field, I recreated the same scatterplot (also called a bubble chart) with all of them. Based on the results, I published two listicles: One for data vis applications and one for data vis libraries and programming languages. An overview of all the tools I tried can be found in this Google Spreadsheet.

The Participatory Panopticon vs. The Pentagon The Participatory Panopticon vs. The Pentagon Digital cameras may have had their Rodney King moment this last week, with the pictures taken of prisoner abuses by American troops in Iraq, sent via email around the world.

Converting PDFs to Usable Data for the International Journalism Festival 2012 Dan Nguyen twitter: @dancow / @propublica April 26, 2012 Shortlink: Note: This guide only covers the better known, more useful methods. Convert PDF to Excel, Word with the PDF Converter - Able2Extract Open Select Convert Advanced PDF Handling Need image (scanned) PDF conversion to Excel, Word, and PowerPoint? Able2Extract Professional combines leading edge technology with our proprietary PDF conversion algorithm to deliver high quality conversions every time. This is great for people working with paper documents and wanting to access them electronically. Learn More About Able2Extract Professional Able2Extract (A2E) is the Ultimate Data Conversion Utility!

References for visualising uncertainty One of the increasingly frequent questions I get asked, particularly by people from a scientific or financial domain, is how to effectively visualise uncertainty of data and of statistics. My response is usually to make suggestions around annotated markings and/or colour gradients to indicate increasing or declining certainties. I've been gathering bits evidence for these suggestions and any other sample solutions that might work in different contexts. There aren't that many but I have compiled some references, papers and examples for anyone interested. If any others emerge I will add them to this list, so if you have any suggestions, please let me know: 1.

The Rise of the Participatory Panopticon This week, I spoke at the first MeshForum conference, held in Chicago. The following is an adaptation of my talk, which adapts some earlier material with some new observations. Fair warning: it's a long piece. I look forward to your comments. The photo at right is by Howard Greenstein, taken during my presentation. Soon -- probably within the next decade, certainly within the next two -- we'll be living in a world where what we see, what we hear, what we experience will be recorded wherever we go. My Life Log: Scraping PDF's in Python So, in the course of grabbing some additional data sources for GovCheck, I needed to scrape a few pdf's and insert the information into my database. After looking high and low, I found an acceptable solution to do this using Python - pdfminer. It's not perfect, but it's much better than the rest of the pdf to html/txt converter tools - at-least as far as scraping goes. So I figured I'd note here how I wrote my scraping code. As a reference point, I was parsing election data for the past election using this pdf file.You start off with running the code through pdfminer and getting the resulting HTML back.

Hub - Find, grab and organize all kinds of data and media from online sources. OutWit Hub Light is free and fully operational, but doesn’t include the automation features and limits the extraction to one or few hundred rows, depending on the extractor. When purchasing the Pro version, you will receive a key to remove these limitations and unlock all advanced features. Buy Now. The inline help function covers light and pro features. Check it out and get acquainted with OutWit Hub at no cost