background preloader

Coding for Journalists 101 : A four-part series

Coding for Journalists 101 : A four-part series
Photo by Nico Cavallotto on Flickr Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here. I’m only keeping this old walkthrough up as a historical reference. So check it out: The Bastards Book of Ruby -Dan Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. As the tutorials are aimed at people who aren’t experienced programming, the code is pretty verbose, pedantic, and in some cases, a little inefficient.

Data Scraping Wikipedia with Google Spreadsheets Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping. So here’s a quick summary of (part of) what I found I could do. The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping. Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells: Why CSV? Lurvely… :-)

Masterclass 20: Getting started in data journalism If you are impatient to get started, and just quickly do some data journalism, click here If you aren't a subscriber, you'll need to sign up before you can access the rest of this masterclass If you want to find out what data journalism is, and what it's for, before you get stuck in, then read on, or click on the video or audio files Video: Are you confused about what data journalism is, how you do it, and what its purpose is? If so, join the club. There is a mystique surrounding data journalism; it’s almost like it’s a dark art and you have to be a wizard to practise it. A very few people are brilliant at it, a number have dabbled in it, loads of journalists think they probably ought to find out about it, but most fear they probably won’t be able to master it. All this throws up a smoke screen about the subject that I hope to dispel in this masterclass. What data journalism is I am to show what data journalism is, what it can do, and how to do it.

Network Graph - Fusion Tables Help Current limitations include: The visualization will only show relationships from the first 100,000 rows in a table. A filter can include rows from 100,001 or beyond in the table, but the graph will still not display them. Internet Explorer 8 and below are not supported. Each row of a table represents one relationship in the graph. The network graph shows each row as a line connecting a person and a dog. To create a Network Graph in the New look: [+] > Add chart Click the Network Graph button. To create a Network Graph in Classic: Experiment > Network Graph By default, the first two text columns will be selected as the source of nodes: Node column 1 and Node column 2. Adjust the Network Graph's display: Select a number column to act as a Weight factor for line length. Interact with the Network Graph: "Camera" zoom means nodes become bigger but not more or less numerous. Good to know: Multiple relationships between two nodes are summed into a thicker line. Try it yourself! Try it yourself!

An Introduction to Compassionate Screen Scraping Screen scraping is the art of programatically extracting data from websites. If you think it's useful: it is. If you think it's difficult: it isn't. We're going to be doing this tutorial in Python, and will use the httplib2 and BeautifulSoup libraries to make things as easy as possible. Websites crash. For my blog, the error reports I get are all generated by overzealous webcrawlers from search engines (perhaps the most ubiquitous specie of screenscraper). This brings us to my single rule for socially responsible screen scraping: screen scraper traffic should be indistinguishable from human traffic. Cache feverently. Now, armed with those three guidelines, lets get started screen scraping. Setup Libraries First we need to install the httplib2 and BeautifulSoup libraries. sudo easy_install BeautifulSoup sudo easy_install httplib2 If you don't have easy_install installed, then you'll need to download them from their project pages at httplib2 and BeautifulSoup. Choosing a Scraping Target

What could a journalist do with ScraperWiki? A quick guide | Scraperwiki Data Blog For non-programmers, a first look at ScraperWiki’s code could be a bit scary, but we want journalists and researchers to make use of the site, so we’ve set up a variety of initiatives to do that. Firstly, we’re setting up a number of Hacks and Hacker Days around the UK, with Liverpool as our first stop outside of London. You can follow this blog or visit our eventbrite page to find out more details. Secondly, our programmers are teaching ScraperWiki workshops and classes around the UK. Anna Powell-Smith took ScraperWiki to the Midlands, and taught Paul Bradshaw’s MA students at Birmingham City University the basics. Julian Todd ran a ‘Scraping 101′ session at the Centre for Investigative Journalism summer school last weekend. You can see his slides here at this link. Julian explained just why ScraperWiki is useful… Your options for webscraping1. Number 3 is where ScraperWiki, a place for sharing scrapers, comes in. (Some more general points from the session can be read here)

guides/data-bulletproofing.md at master · propublica/guides Beautiful Soup: We called him Tortoise because he taught us. [ Download | Documentation | Hall of Fame | For enterprise | Source | Changelog | Discussion group | Zine ] You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. Valuable data that was once locked up in poorly-designed websites is now within your reach. Interested? Getting and giving support If you have questions, send them to the discussion group. If you use Beautiful Soup as part of your work, please consider a Tidelift subscription. Download Beautiful Soup The current release is Beautiful Soup 4.9.1 (May 17, 2020). Beautiful Soup 3 Hall of Fame Development

An introduction to data scraping with Scraperwiki Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms. It’s pretty good. Why screen scraping is useful for journalists Screen scraping can cover a range of things but for journalists it, initially, boils down to a few things: information from somewhere it somewhere that you can get to it later And in a that makes it easy (or easier) to analyse and interrogate So, for instance, you might use a screen scraper to gather information from a local police authority website, and store it in a lovely spreadsheet that you can then sort through, average, total up, filter and so on – when the alternative may have been to print off 80 PDFs and get out the highlighter pens, Post-Its and back-of-a-fag-packet calculations.

Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I A couple of weeks ago, I came across Gephi, a desktop application for visualising networks. And quite by chance, a day or two after I was asked about any tools I knew of that could visualise and help analyse social network activity around an OU course… which I take as a reasonable justification for exploring exactly what Gephi can do :-) So, after a few false starts, here’s what I’ve learned so far… First up, we need to get some graph data – netvizz – facebook to gephi suggests that the netvizz facebook app can be used to grab a copy of your Facebook network in a format that Gephi understands, so I installed the app, downloaded my network file, and then uninstalled the app… (can’t be too careful ;-) Once Gephi is launched (and updated, if it’s a new download – you’ll see an updates prompt in the status bar along the bottom of the Gephi window, right hand side) Open… the network file you downloaded. You can also generate views of the graph that show information about the network. Like this:

Python Programming Language – Official Website Needlebase

Related: