background preloader

Scraping for Journalism: A Guide for Collecting Data

Scraping for Journalism: A Guide for Collecting Data
Photo by Dan Nguyen/ProPublica Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data. Most of the techniques are within the ability of the moderately experienced programmer. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you. The tools With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source. Google Refine (formerly known as Freebase Gridworks) – A sophisticated application that makes data cleaning a snap. Ruby – The programming language we use the most at ProPublica.

Data Extraction Data Extraction and Web Scraping A key activity in web automation is the extraction of data from websites, also known as web scraping or screen scraping. Whether it is price lists, stock information, financial data or any other type of data, iMacros can extract this data for you and either re-use the data or store it in a file or database. iMacros can write extracted data to standard text files, including the comma separated value (.csv) format, readable by spreadsheet processing packages. The Extract command Data extraction is specified by an EXTRACT parameter in the TAG command. TAG POS=1 TYPE=SPAN ATTR=CLASS:bdytxt&&TXT:* EXTRACT=HTM This means that the syntax of the command is now the same as for the TAG command, with the type of extraction specified by the additional EXTRACT parameter. Creation of Extraction Tags Extraction Wizard Text Extraction Wizard The Extraction Wizard can be used to automatically generate and test extractions. To define an EXTRACT command proceed as follows: Example:

Scraper un site en Ruby pour les nuls (ou presque) # encoding: UTF-8 require 'open-uri' require 'nokogiri' require 'csv' # Nettoie les caractères inutiles dans une chaine def clean str str.strip.gsub("\n", ' ').gsub(' ', ' ').gsub(' ', ' ').gsub(' ', ' ').gsub(' ', ' ').gsub(' ', ' ').gsub(' ', ' ') end # les types de décisions # on va écrire dans ce fichier"conseil_constitutionel.csv", "w") do |csv| # l'entête csv << ["Année", "Numéro", "Date", "N°", "Type", "Intitulé", "Décision", "URL"] # le point d'entrée main_url = " # dans cette page on récupère tous les liens qui sont dans le div #articlesArchives qui vont correspondre aux pages listant les décisions Nokogiri::HTML(open(main_url)).search('#articlesArchives a').each do |a| # le contenu du lien corespond à l'année year = a.inner_text Nokogiri::XML(open(url_decision), nil, 'UTF-8').search('#articles li').each do |decision| if index_id

Le datajournalisme: vecteur de sens et de profits Face à l'avalanche d'informations, les techniques de datamining permettent d'extraire du sens de bases de données. La confiance devient la ressource rare, créatrice de valeur. Et les médias peuvent s'en emparer. Ce post reprend les éléments d’une réflexion amorcée avec Mirko Lorenz et Geoff McGhee dans un article intitulé Media Companies Must Become Trusted Data Hubs [en] et présentée à la conférence re:publica XI. Chaque jour, nous produisons deux ou trois exaoctets [en] de données, soit 1 million de téraoctets. Si l’on veut synthétiser toute l’information produite en quelque chose de digeste pour l’utilisateur final, il faut résumer par un facteur de 100 milliards. Pour faire sens de cette hyper-abondance de contenus, les professionnels de l’information doivent adopter de nouvelles techniques. Une fois équipé des bons outils, faire parler des masses de données devient possible. Toute information est une donnée Certaines initiatives vont dans ce sens. Médias liquides

Chapter 1. Using Google Refine to Clean Messy Data Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine: It’s free.It works in any browser and uses a point-and-click interface similar to Google Docs.Despite the Google moniker, it works offline. There’s no requirement to send anything across the Internet.There’s a host of convenient features, such as an undo function, and a way to visualize your data’s characteristics. Photo by daniel.gene

How to use LinkedIn for data miners If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting! After the article How to use twitter for data miners, let me propose advices on using LinkedIn. First, you may already know that your LinkedIn account can be linked to display your tweets (see this link). Continue by searching for other people with the same interests (use the same keywords as above). The next step is to participate to data mining groups, such as: ACM SIGKDDAdvanced Business Analytics, Data Mining and Predictive ModelingAnalyticBridgeBusiness AnalyticsCRISP-DMCustomers DNAData MinersData Mining TechnologyData Mining, Statistics, and Data VisualizationMachine Learning ConnectionOpen Source Data MiningSmartData Collective

DATA: Without adding context, a journalist with data can be dangerous If you believe the predictions, 2011 will be the year when journalists have more access to data than ever before. Of course, much of the data will also be accessible to the public in general but I suspect more people will be exposed to data via journalism than will actively seek it themselves. And with that comes a responsibility to make sure that journalists present the full picture with a set of data. In other words, add some context. The old phrase about lies, lies and statistics can be true if one set of data is taken in isolation. Paul Bradshaw touched on this when looking at a story in November which ‘revealed’ that Birmingham had more CCTV cameras than any other council area. So the challenge for 2011 isn’t just making use of all the data that’s available, it’s making use of it responsibly, linking data together to come up with a true picture. If journalists don’t do this, then there will be people who do it for them, post publication. ... This data took 10 minutes to compile.

Chapter 2: Reading Data from Flash Sites Flash applications often disallow the direct copying of data from them. But we can instead use the raw data files sent to the web browser. Adobe Flash can make data difficult to extract. This tutorial will teach you how to find and examine raw data files that are sent to your web browser, without worrying how the data is visually displayed. For example, the data displayed on this Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page. Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website. Background In September 2008, drug company Cephalon pleaded guilty to a misdemeanor charge and settled a civil lawsuit involving allegations of fraudulent marketing of its drugs. Cephalon's report is not downloadable and the site disables the mouse’s right-click function, which typically brings up a pop-up menu with the option to save the webpage or inspect its source code.

Datamining Twitter On its own, Twitter builds an image for companies; very few are aware of this fact. When a big surprise happens, it is too late: a corporation suddenly sees a facet of its business — most often a looming or developing crisis — flare up on Twitter. As always when a corporation is involved, there is money to be made by converting the problem into an opportunity: Social network intelligence is poised to become a big business. In theory, when it comes to assessing the social media presence of a brand, Facebook is the place to go. But as brands flock to the dominant social network, the noise becomes overwhelming and the signal — what people really say about the brand — becomes hard to extract. By comparison, Twitter more swiftly reflects the mood of users of a product or service. Datamining Twitter is not trivial. Companies such as DataSift (launched last month) exploit the Twitter fire hose by relying on the 40-plus metadata included in a post. …is a rich trove of data.

The Necessity of Data Journalism in the New Digital Community This is the second post in a series from Nicholas White, the co-founder and CEO of The Daily Dot. It used to be, to be a good reporter, all you had to do was get drunk with the right people. Sure, it helped if you could string a few words together, but what was really important was that when news broke, you could get the right person on the phone and get the skinny. Or when something scandalous was going down somewhere, someone would pick up the phone and call you. Increasingly today, in selecting and training reporters, the industry seems to focus on the stringing-words-together part. That’s not how we’re building our newsroom at The Daily Dot. One: Our very first newsroom hire, after our executive editor, was Grant Robertson, who’s not only a reporter and an editor, but also a programmer. We found it necessary to push early in this direction because of our unique coverage area and we’re in the fortunate position of being able to build our newsroom from scratch. How do we report on that?

Chapter 4: Scraping Data from HTML Web-scraping is essentially the task of finding out what input a website expects and understanding the format of its response. For example, takes a user's zip code as input before returning a page showing federal stimulus contracts and grants in the area. This tutorial will teach you how to identify the inputs for a website and how to design a program that automatically sends requests and downloads the resulting web pages. Pfizer disclosed its doctor payments in March as part of a $2.3 billion settlement - the largest health care fraud settlement in U.S. history - of allegations that it illegally promoted its drugs for unapproved uses. Of the disclosing companies so far, Pfizer's disclosures are the most detailed and its site is well-designed for users looking up individual doctors. So we will write a scraper to download Pfizer's list and record the data in spreadsheet form. You may also find Firefox's Firebug plugin useful for inspecting the source HTML. Data Structure

A special report on managing information: Data, data everywhere WHEN the Sloan Digital Sky Survey started work in 2000, its telescope in New Mexico collected more data in its first few weeks than had been amassed in the entire history of astronomy. Now, a decade later, its archive contains a whopping 140 terabytes of information. A successor, the Large Synoptic Survey Telescope, due to come on stream in Chile in 2016, will acquire that quantity of data every five days. Such astronomical amounts of information can be found closer to Earth too. Wal-Mart, a retail giant, handles more than 1m customer transactions every hour, feeding databases estimated at more than 2.5 petabytes—the equivalent of 167 times the books in America's Library of Congress (see article for an explanation of how data are quantified). Facebook, a social-networking website, is home to 40 billion photos. All these examples tell the same story: that the world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. Dross into gold

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List | Dan Nguyen pronounced fast is danwin Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state. Update (4/28): Replaced the code and result files. Still haven’t written out a thorough explainer of what’s going on here. Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. From the NYT: Pfizer, the world’s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. So, not an entirely altruistic release of information. Not bad at first glance. The Code The Results

Tim Berners-Lee: “Les gouvernements devraient encourager l’ouverture des données” » De l'ouverture des données publiques à l'avenir du réseau en passant par HTML 5, retour sur ce que devient le web avec l'un de ses principaux inventeurs. A l’occasion d’une conférence annuelle du W3C qui s’est tenue le 2 novembre dernier à Lyon, la rédaction du MagIT a rencontré Tim Berners-Lee, le père du Web et un des patrons du consortium. Au programme, le Web sémantique, l’ouverture des données, HTML 5 et la fondation W3. Le Web sémantique évolue-t-il au rythme que vous espériez ? Tim Berners-Lee : Je n’avais pas de prévisions à proprement dit. L’approche sémantique émerge également sur le desktop, comme Nepomuk (un projet de desktop sémantique qui fait une première apparition dans Mandriva 2010), peut être plus rapidement sur le Web. Quel est aujourd’hui le niveau de maturité des outils en place ? TBL : Quand on considère le Web sémantique, développer de nouveaux outils est toujours fascinant, tout se connecte et s’auto-alimente. Tim Berners-Lee, prophète de l'open data?

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully. | Dan Nguyen pronounced fast is danwin UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby. I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Who this post is for His Girl Friday You’re a journalist who knows almost nothing about computers beyond using them to connect to the Internets, email, and cheat on Facebook scrabble. Anyone who has taken a semester of computer science will scoff at how I’ve simplified even the basic fundamentals of programming…and they’d be right…but my goal is just to get you into the basics to write some useful code immediately. Thankfully, coding is something that provides immediate success and failure. The roadmap The task Tags Here is a h4 headline Strings