Data Scraping Wikipedia with Google Spreadsheets Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping. So here’s a quick summary of (part of) what I found I could do. The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping. Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells: Why CSV? Lurvely… :-)
Masterclass 20: Getting started in data journalism If you are impatient to get started, and just quickly do some data journalism, click here If you aren't a subscriber, you'll need to sign up before you can access the rest of this masterclass If you want to find out what data journalism is, and what it's for, before you get stuck in, then read on, or click on the video or audio files Video: Are you confused about what data journalism is, how you do it, and what its purpose is? If so, join the club. There is a mystique surrounding data journalism; it’s almost like it’s a dark art and you have to be a wizard to practise it. A very few people are brilliant at it, a number have dabbled in it, loads of journalists think they probably ought to find out about it, but most fear they probably won’t be able to master it. All this throws up a smoke screen about the subject that I hope to dispel in this masterclass. What data journalism is I am to show what data journalism is, what it can do, and how to do it.
An Introduction to Compassionate Screen Scraping Screen scraping is the art of programatically extracting data from websites. If you think it's useful: it is. If you think it's difficult: it isn't. We're going to be doing this tutorial in Python, and will use the httplib2 and BeautifulSoup libraries to make things as easy as possible. Websites crash. For my blog, the error reports I get are all generated by overzealous webcrawlers from search engines (perhaps the most ubiquitous specie of screenscraper). This brings us to my single rule for socially responsible screen scraping: screen scraper traffic should be indistinguishable from human traffic. Cache feverently. Now, armed with those three guidelines, lets get started screen scraping. Setup Libraries First we need to install the httplib2 and BeautifulSoup libraries. sudo easy_install BeautifulSoup sudo easy_install httplib2 If you don't have easy_install installed, then you'll need to download them from their project pages at httplib2 and BeautifulSoup. Choosing a Scraping Target
What could a journalist do with ScraperWiki? A quick guide | Scraperwiki Data Blog For non-programmers, a first look at ScraperWiki’s code could be a bit scary, but we want journalists and researchers to make use of the site, so we’ve set up a variety of initiatives to do that. Firstly, we’re setting up a number of Hacks and Hacker Days around the UK, with Liverpool as our first stop outside of London. You can follow this blog or visit our eventbrite page to find out more details. Secondly, our programmers are teaching ScraperWiki workshops and classes around the UK. Anna Powell-Smith took ScraperWiki to the Midlands, and taught Paul Bradshaw’s MA students at Birmingham City University the basics. Julian Todd ran a ‘Scraping 101′ session at the Centre for Investigative Journalism summer school last weekend. You can see his slides here at this link. Julian explained just why ScraperWiki is useful… Your options for webscraping1. Number 3 is where ScraperWiki, a place for sharing scrapers, comes in. (Some more general points from the session can be read here)
Beautiful Soup: We called him Tortoise because he taught us. [ Download | Documentation | Hall of Fame | For enterprise | Source | Changelog | Discussion group | Zine ] You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. Valuable data that was once locked up in poorly-designed websites is now within your reach. Interested? Getting and giving support If you have questions, send them to the discussion group. If you use Beautiful Soup as part of your work, please consider a Tidelift subscription. Download Beautiful Soup The current release is Beautiful Soup 4.9.1 (May 17, 2020). Beautiful Soup 3 Hall of Fame Development
An introduction to data scraping with Scraperwiki Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms. It’s pretty good. Why screen scraping is useful for journalists Screen scraping can cover a range of things but for journalists it, initially, boils down to a few things: information from somewhere it somewhere that you can get to it later And in a that makes it easy (or easier) to analyse and interrogate So, for instance, you might use a screen scraper to gather information from a local police authority website, and store it in a lovely spreadsheet that you can then sort through, average, total up, filter and so on – when the alternative may have been to print off 80 PDFs and get out the highlighter pens, Post-Its and back-of-a-fag-packet calculations.
Python Programming Language – Official Website Needlebase Creating a Scraper for Multiple URLs Using Regular Expressions | OutWit Technologies Blog Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.You should run these to discover the Hub. NOTE: This tutorial was created using version 0.8.2. In this example we’ll redo the scraper from the previous lesson using Regular Expressions. Recap: For complex web pages or specific needs, when the automatic data extraction functions (table, list, guess) don’t provide you with exactly what you are looking for, you can extract data manually by creating your own scraper. First, Launch OutWit Hub then open in the Page view: In the page view you will see a list of leading firms by activity. Traditionally, you’d have to click on each link, then copy and paste the information into an excel spreadsheet, but with the scraper function we’re going to save a lot of time and energy.
Telling Better Stories by Designing Custom Maps Using TileMill Plotting information — say survey data in Pakistan’s Federally Administered Tribal Areas or election results in Afghanistan — on any kind of map adds critical geo-context to the data. These maps quickly become move powerful when you start adding more custom overlays, showing data like where different ethnic groups live, high incidents of corruption, or more complex visuals like the number of deaths per drone strike in Pakistan and which U.S. president ordered it. What is really amazing is how accessible it is now for people to make custom maps to be able to tell more complex stories with data. Specifically, tools like Google Maps, OpenLayers, and Polymaps have made basic web mapping ubiquitous by making it simple to drop a map into a website, and their APIs open the door for everyone to customize maps by adding custom layers. The trick now is to radically reduce the barrier to entry for making these overlays and custom base maps.
Branded journalists battle newsroom regulations With social media a big part of newsroom life, individual journalists often find their personal brands attractive selling points for future employers. But lately many of these same social media superstars are questioning whether newsrooms are truly ready for the branded journalist. In late January, Matthew Keys, Deputy Social Media Editor at Reuters, wrote a blog post in which he criticized his former employer (ABC affiliate KGO-TV in San Francisco) for taking issue with his use of social media. Keys says his supervisors questioned the language, tone and frequency of his tweets, as well as his judgment when he retweeted his competitors. Not long after Keys’ post went live, CNN’s Roland Martin was suspended for comments he tweeted during the Super Bowl. Then came the news that Britain’s Sky News had revised its social media policies that forbid, among other things, retweeting Sky competitors. NPR’s media correspondent David Folkenflik has a thriving social media presence. Employers: