background preloader

Chapter 1. Using Google Refine to Clean Messy Data

Chapter 1. Using Google Refine to Clean Messy Data
Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine: It’s free.It works in any browser and uses a point-and-click interface similar to Google Docs.Despite the Google moniker, it works offline. Download and installation instructions for Refine are here. This tutorial covers the same ground as this screencast by Refine’s developer David Huynh (the other two videos are here): Starting a Project

Chapter 2: Reading Data from Flash Sites Flash applications often disallow the direct copying of data from them. But we can instead use the raw data files sent to the web browser. Adobe Flash can make data difficult to extract. This tutorial will teach you how to find and examine raw data files that are sent to your web browser, without worrying how the data is visually displayed. For example, the data displayed on this Recovery.gov Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page. Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website. Background In September 2008, drug company Cephalon pleaded guilty to a misdemeanor charge and settled a civil lawsuit involving allegations of fraudulent marketing of its drugs. Cephalon's report is not downloadable and the site disables the mouse’s right-click function, which typically brings up a pop-up menu with the option to save the webpage or inspect its source code.

Data Wrangler UPDATE: The Stanford/Berkeley Wrangler research project is complete, and the software is no longer actively supported. Instead, we have started a commercial venture, Trifacta. For the most recent version of the tool, see the free Trifacta Wrangler. Why wrangle? Too much time is spent manipulating data just to get analysis and visualization tools to read it. DocumentCloud In Search of Information Governance in the Enterprise It’s an understatement to say companies are drowning in digital information. Since the death of the floppy disk and the rise of networked computing, barriers to creating and sharing information have steadily come down. Combined with increased digitization paper-laden business processes, most companies find themselves struggling to harness the volume and diversity of information on their networks for business benefit. What’s startling is just how little progress we've made in maximizing the value and minimizing risks associated with the digital content and data we collect. Any discussion of information governance always brings me back to this depressing little anecdote: "Monday September 8, 2008, is a day that the executives at United Airlines will remember. Of course the incident led to a flurry of finger pointing: was it the Webmaster at the newspaper's site that let an old article get out? This is as true on the Internet as it is on the networks of the worlds’ largest companies.

Chapter 4: Scraping Data from HTML Web-scraping is essentially the task of finding out what input a website expects and understanding the format of its response. For example, Recovery.gov takes a user's zip code as input before returning a page showing federal stimulus contracts and grants in the area. This tutorial will teach you how to identify the inputs for a website and how to design a program that automatically sends requests and downloads the resulting web pages. Pfizer disclosed its doctor payments in March as part of a $2.3 billion settlement - the largest health care fraud settlement in U.S. history - of allegations that it illegally promoted its drugs for unapproved uses. Of the disclosing companies so far, Pfizer's disclosures are the most detailed and its site is well-designed for users looking up individual doctors. So we will write a scraper to download Pfizer's list and record the data in spreadsheet form. You may also find Firefox's Firebug plugin useful for inspecting the source HTML. Data Structure

Gestion des données Un article de Wikipédia, l'encyclopédie libre. Enjeux de la gestion des données[modifier | modifier le code] Il y a d'abord le besoin de pouvoir anticiper. Or, du fait de la nature même de la conception de beaucoup de systèmes décisionnels, qui manipulent des données de carnet de commande et de chiffre d'affaires, la visibilité des systèmes de pilotage classiques est souvent limitée. Elle dépend beaucoup du secteur économique et de la durée du cycle de vie : long terme (pour le nucléaire), court terme (pour les produits de grande consommation)… La possibilité de recherche d'information par des moteurs de recherche en source ouverte offre certes des possibilités nouvelles considérables, mais en même temps présente plusieurs difficultés : le bruit informationnel, et les risques de pillage technologique, qui posent des questions de protection des données sur le plan juridique. Démarche générale de la gestion des données[modifier | modifier le code] Voir aussi[modifier | modifier le code]

MentorMob - Learn What You Want, Teach What You Love - MentorMob Une année historique pour SAP France, croissance à deux chiffres par Bertrand Garé, le 23 janvier 2014 16:10 En marge de l’annonce des résultats annuels de l’éditeur allemand, Franck Cohen, Président de SAP EMEA, est revenu sur les tendances en Europe et en France. Malgré une situation économique toujours difficile, SAP tire bien son épingle du jeu et la France se distingue par une année historique grâce à des contrats d’envergure. Si les analystes financiers ont pris un peu de haut les résultats de SAP lors de leurs annonces, les détails sur la zone Europe ne sont pas sans intérêt et montrent une croissance globale sur la zone de 9 %. Autres secteurs à se distinguer, le Cloud et les lignes de métiers qui connaissent des croissances spectaculaires durant cette dernière année avec là encore des contrats significatifs comme Burberry’s ou l’Oréal. Les PME/PMI ont aussi largement contribué au développement des résultats.

Coding for Journalists 104: Pfizer’s Doctor Payments; Making a Better List | Dan Nguyen pronounced fast is danwin Update (12/30): So about an eon later, I’ve updated this by writing a guide for ProPublica. Heed that one. This one will remain in its obsolete state. Update (4/28): Replaced the code and result files. Update (4/19): After revisiting this script, I see that it fails to capture some of the payments to doctors associated with entities. So the world’s largest drug maker, Pfizer, decided to tell everyone which doctors they’ve been giving money to to speak and consult on its behalf in the latter half of 2009. From the NYT: Pfizer, the world’s largest drug maker, said Wednesday that it paid about $20 million to 4,500 doctors and other medical professionals for consulting and speaking on its behalf in the last six months of 2009, its first public accounting of payments to the people who decide which drugs to recommend. So, not an entirely altruistic release of information. Not bad at first glance. Which doctor received the most? The Code These are the steps we’ll take: The Results

Data cleansing After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records). Some data cleansing solutions will clean data by cross checking with a validated data set. Motivation[edit] In the business world, incorrect data can be costly.

SharedCopy loves programming and has been doing that professionally for over 10 years - moving from C, Perl, Java to Ruby & Javascript. Choon Keat is curious about everything and is always thinking of ways to improve [his] life with better design, better software. Sometimes, he even blogs about these things. Choon Keat is practical, delivers simple solution and executes iteratively. He has been practicing that on a startup he has founded, SharedCopy. Professional summary Choon Keat finds himself involved in startups most of the time. His hands-on experience range from (server-side) administrating Linux boxes, writing web apps & integration with SMSC, MMSC & Jabber, to (client-side) writing Eclipse plugins, Javascript, Flash ActionScript, to programming on various other devices like phones and the TV. His recent work revolves around email, Ruby, Rails and a lot of Javascript. Software passion Choon Keat loves open source. Choon Keat understands that experiments are a large part of learning. Contact

Data Quality Rules: A Management Primer for Taking Action — Data Quality Pro Scenario B - Focused and Productive LLC. Company B has a formal data quality strategy and a series of contact data governance policies in place for ensuring correct management of contact data quality and associated rules.Whenever new applications or coding requirements are suggested, all requests are routed via the Data Quality Officer so their team can coordinate a response and ensure the correct standards and policies are required. Development and business teams don’t need to store hundreds of policies, the central data quality team simply provide the latest and most accurate resource information.Over time Company B has created a rich set of data quality rules in a hierarchical format that allows anyone to quickly navigate and copy the relevant rules for the contact data information they wish to store, retrieve, process or amend.All suppliers and 3rd party IT providers have to adhere to the data governance standards for data quality and contact data application design. Get some buy-in

Coding for Journalists 101: Go from knowing nothing to scraping Web pages. In an hour. Hopefully. | Dan Nguyen pronounced fast is danwin UPDATE (12/1/2011): Ever since writing this guide, I’ve wanted to put together a site that is focused both on teaching the basics of programming and showing examples of practical code. I finally got around to making it: The Bastards Book of Ruby. I’ve since learned that trying to teach the fundamentals of programming in one blog post is completely dumb. Also, I hope I’m a better coder now than I was a year and a half ago when I first wrote this guide. Check it out and let me know what you think: Who this post is for His Girl Friday You’re a journalist who knows almost nothing about computers beyond using them to connect to the Internets, email, and cheat on Facebook scrabble. Anyone who has taken a semester of computer science will scoff at how I’ve simplified even the basic fundamentals of programming…and they’d be right…but my goal is just to get you into the basics to write some useful code immediately. Thankfully, coding is something that provides immediate success and failure. Tags

DataHub Tool - Wiki Describe DataHub here. Datahub is a tool that allows faster download/crawl, parse, load, and visualize of data. It achieves this by allowing you to divide each step into its own work folders. In each work folder you get a sample files that you can start coding. Datahub is for people who found some interesting data source for them, they want to download it, parse it, load it into database, provide some documentation, and visualize it. Datahub will speed up the process by creating folder for each of these actions. Code Repository: Sensitive, and possibly inaccurate, information may not be used against people in financial, political, employment, and health-care settings. Engineering Part Acquire Parse Filter Mine Design Part 5.Represent 6.refine 7.Interact The best way to get started with datahub is to install it in the following way: Setup virtualenv which will keep the installation in a separate directory. Download the source and untar it: *Install it Done. crawl

Related: