Comsaint

API

How to extract data from a PDF. We live in a world where PDF is king. Perhaps we could even go as far as to call it the tyranny of the PDF. Developed in the early 90s as a way to share documents among computers running incompatible software, the Portable Document Format (PDF) offers a consistent appearance on all devices, ensuring content control and making it difficult for others to copy the information contained within. However, for a data journalist whose job depends on being able to extract bulk data for analysis and visualisation, PDFs as the filetype of choice does not tend to go down well.

In a field of journalism where the spreadsheet rules the roost, we explore a few ways of turning data enclosed within PDFs to spreadsheets (excel xls or CSV), into data primed for analysis. What’s always important to remember in trying to get data out of PDF files is that there is no single catch-all way that works for every occasion, sometimes it’s just a matter of trying each one until you find the one that works. Here’s how:

Comsaint

API

Openness