background preloader

Coding for Journalists 101 : A four-part series

Coding for Journalists 101 : A four-part series
Photo by Nico Cavallotto on Flickr Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here. I’m only keeping this old walkthrough up as a historical reference. So check it out: The Bastards Book of Ruby -Dan Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient. Related:  Data Journalism

An Introduction to Compassionate Screen Scraping Screen scraping is the art of programatically extracting data from websites. If you think it's useful: it is. If you think it's difficult: it isn't. We're going to be doing this tutorial in Python, and will use the httplib2 and BeautifulSoup libraries to make things as easy as possible. Websites crash. For my blog, the error reports I get are all generated by overzealous webcrawlers from search engines (perhaps the most ubiquitous specie of screenscraper). This brings us to my single rule for socially responsible screen scraping: screen scraper traffic should be indistinguishable from human traffic. Cache feverently. Now, armed with those three guidelines, lets get started screen scraping. Setup Libraries First we need to install the httplib2 and BeautifulSoup libraries. sudo easy_install BeautifulSoup sudo easy_install httplib2 If you don't have easy_install installed, then you'll need to download them from their project pages at httplib2 and BeautifulSoup. Choosing a Scraping Target

Data journalism pt1: Finding data (draft – comments invited) {*style:<i>The following is a draft from a book about online journalism that I’ve been working on. I’d really appreciate any additions or comments you can make – particularly around sources of data and legal considerations </i>*} The first stage in data journalism is sourcing the data itself. Often you will be seeking out data based on a particular question or hypothesis (for a good guide to forming a journalistic hypothesis see Mark Hunter’s free ebook Story-Based Inquiry (2010)). On other occasions, it may be that the release or discovery of data itself kicks off your investigation. There are a range of sources available to the data journalist, both online and offline, public and hidden. national and local government; bodies that monitor organisations (such as regulators or consumer bodies); scientific and academic institutions; health organisations; charities and pressure groups; business; and the media itself. Private companies and charities Regulators, researchers and the media Live data

Masterclass 20: Getting started in data journalism If you are impatient to get started, and just quickly do some data journalism, click here If you aren't a subscriber, you'll need to sign up before you can access the rest of this masterclass If you want to find out what data journalism is, and what it's for, before you get stuck in, then read on, or click on the video or audio files Video: Are you confused about what data journalism is, how you do it, and what its purpose is? If so, join the club. There is a mystique surrounding data journalism; it’s almost like it’s a dark art and you have to be a wizard to practise it. A very few people are brilliant at it, a number have dabbled in it, loads of journalists think they probably ought to find out about it, but most fear they probably won’t be able to master it. All this throws up a smoke screen about the subject that I hope to dispel in this masterclass. What data journalism is I am to show what data journalism is, what it can do, and how to do it.

Scraping for Journalism: A Guide for Collecting Data Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you. The tools With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source. Google Refine (formerly known as Freebase Gridworks) – A sophisticated application that makes data cleaning a snap. Ruby – The programming language we use the most at ProPublica.

How to make your infographics accessible and SEO friendly at the same time Infographics are everywhere. Some good - some bad. But most creators don't stop to think how to make sure search engines can understand their infographic - or how people who can't see pictures can consume them (maybe because they rely on screen readers or have chosen not to download images to their mobile phone). The trick to make infographics accessible and SEO friendly is to ensure: they're chopped into relevant sections (ie not one big image),text is text (you should be able to select it with a mouse)if anything has to be shown as an image, you set appropriate ALT text (the flipside of this is that, if the image doesn't add any information, you DON'T set ALT text - I'll explain this below.) Making an infographic accessible There's lots of infographics out there. Also I should point out that I'm a crap HTML coder so if anyone can improve on this, do let me know. Separate images and text As it stands, that bottom left bit is just part of an enormous image. What now? OK, you're thinking.

What could a journalist do with ScraperWiki? A quick guide | Scraperwiki Data Blog For non-programmers, a first look at ScraperWiki’s code could be a bit scary, but we want journalists and researchers to make use of the site, so we’ve set up a variety of initiatives to do that. Firstly, we’re setting up a number of Hacks and Hacker Days around the UK, with Liverpool as our first stop outside of London. You can follow this blog or visit our eventbrite page to find out more details. Secondly, our programmers are teaching ScraperWiki workshops and classes around the UK. Anna Powell-Smith took ScraperWiki to the Midlands, and taught Paul Bradshaw’s MA students at Birmingham City University the basics. Julian Todd ran a ‘Scraping 101′ session at the Centre for Investigative Journalism summer school last weekend. You can see his slides here at this link. Julian explained just why ScraperWiki is useful… Your options for webscraping1. Number 3 is where ScraperWiki, a place for sharing scrapers, comes in. (Some more general points from the session can be read here)

7 Classic Foundational Vis Papers You Might not Want to Publicly Confess you Don?t Know ? Fell in Love with Data (In my last post I introduced the idea of regularly posting research material in this blog as a way to bridge the gap between researchers and practitioners. Some people kindly replied to my call for feedback and the general feeling seems to be like: “cool go on! rock it! we need it!”. Ok, thanks guys your encouragement is very much needed. I love you all. Even if I am definitely not a veteran of infovis research (far from it) I started reading my first papers around the year 2000 and since then I’ve never stopped. come from the very early days of infovisare foundationalare cited over and overI like a lot Of course this doesn’t mean these are the only ones you should read if you want to dig into this matter. Advice: in order to really appreciate them you have to think they have all been written during the ’90s (some even in the ’80s!). Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. Please don’t tell me you don’t know this one!

An introduction to data scraping with Scraperwiki Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms. It’s pretty good. Why screen scraping is useful for journalists Screen scraping can cover a range of things but for journalists it, initially, boils down to a few things: information from somewhere it somewhere that you can get to it later And in a that makes it easy (or easier) to analyse and interrogate So, for instance, you might use a screen scraper to gather information from a local police authority website, and store it in a lovely spreadsheet that you can then sort through, average, total up, filter and so on – when the alternative may have been to print off 80 PDFs and get out the highlighter pens, Post-Its and back-of-a-fag-packet calculations.

dataist blog: An inspiring case for journalists learning to code | Dan Nguyen pronounced fast is danwin About a year ago I threw up a long, rambling guide hoping to teach non-programming journalists some practical code. Looking back at it, it seems inadequate. Actually, I misspoke, I haven’t looked back at it because I’m sure I’ll just spend the next few hours cringing. For example, what a dumb idea it was to put everything from “What is HTML” to actual Ruby scraping code all in a gigantic, badly formatted post. The series of articles have gotten a fair number of hits but I don’t know how many people were able to stumble through it. Mapping of Ratata blogging network by Jens Finnäs of dataist.wordpress.com I hope other non-coders who are still intimidated by the thought of learning programming are inspired by Finnas’s example. ProPublica’s Dollars for Docs project originated in part from this Pfizer-scraping lesson I added on to my programming tutorial: I needed a timely example of public data that wasn’t as useful as it should be. In fact, just knowing to avoid taking notes like this:

Needlebase Telling Better Stories by Designing Custom Maps Using TileMill Plotting information — say survey data in Pakistan’s Federally Administered Tribal Areas or election results in Afghanistan — on any kind of map adds critical geo-context to the data. These maps quickly become move powerful when you start adding more custom overlays, showing data like where different ethnic groups live, high incidents of corruption, or more complex visuals like the number of deaths per drone strike in Pakistan and which U.S. president ordered it. What is really amazing is how accessible it is now for people to make custom maps to be able to tell more complex stories with data. Specifically, tools like Google Maps, OpenLayers, and Polymaps have made basic web mapping ubiquitous by making it simple to drop a map into a website, and their APIs open the door for everyone to customize maps by adding custom layers. The trick now is to radically reduce the barrier to entry for making these overlays and custom base maps.

Related: