background preloader

Data Wrangler

Data Wrangler
UPDATE: The Stanford/Berkeley Wrangler research project is complete, and the software is no longer actively supported. Instead, we have started a commercial venture, Trifacta. For the most recent version of the tool, see the free Trifacta Wrangler. Why wrangle? Too much time is spent manipulating data just to get analysis and visualization tools to read it. Wrangler is designed to accelerate this process: spend less time fighting with your data and more time learning from it.

Protovis Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layouts to simplify construction. Protovis is free and open-source, provided under the BSD License. It uses JavaScript and SVG for web-native visualizations; no plugin required (though you will need a modern web browser)! Although programming experience is helpful, Protovis is mostly declarative and designed to be learned by example.

Chapter 1. Using Google Refine to Clean Messy Data Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine: It’s free.It works in any browser and uses a point-and-click interface similar to Google Docs.Despite the Google moniker, it works offline. There’s no requirement to send anything across the Internet.There’s a host of convenient features, such as an undo function, and a way to visualize your data’s characteristics.

Setting Data Free With Gapminder Last month Hans Rosling, the Swedish global health professor, statician and sword swallower released a desktop version of Gapminder World, his mesmerizing data visualization tool. Named one of Foreign Policy's top 100 global thinkers in 2009, the information design visionary co-founded with his son and daughter-in-law aiming to make the world's most important trends accessible and digestible to global leaders, policy makers and the general public. The software they developed, Trendalyzer, (acquired by Google in 2007) translates static numbers into dynamic, interactive bubbles moving through time. The desktop version of Gapminder, which is still in beta, allows you to create and present graphs without an Internet connection. Emily Cunningham is a research intern at ReadWriteWeb and a design and user experience intern at

Toxiclibs.js - Open-Source Library for Computational Design There are several areas where toxiclibs.js stands apart to remain more idiomatic and helpful in the javascript environment. For a complete description of the conveniences added to toxiclibs.js, read the sugar file in the repository. Some examples of these differences are: Gestion des données Un article de Wikipédia, l'encyclopédie libre. Enjeux de la gestion des données[modifier | modifier le code] Il y a d'abord le besoin de pouvoir anticiper. Or, du fait de la nature même de la conception de beaucoup de systèmes décisionnels, qui manipulent des données de carnet de commande et de chiffre d'affaires, la visibilité des systèmes de pilotage classiques est souvent limitée.

Datasets on Datavisualization Wikileaks US Embassy Cables 29 Nov 2010 Datasets Infographic, Politics Wikileaks began on Sunday November 28th publishing 251,287 leaked United States embassy cables, the largest set of confidential documents ever to be released into the public domain. Data cleansing After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).

DataHub Tool - Wiki Describe DataHub here. Datahub is a tool that allows faster download/crawl, parse, load, and visualize of data. It achieves this by allowing you to divide each step into its own work folders. In each work folder you get a sample files that you can start coding. Datahub is for people who found some interesting data source for them, they want to download it, parse it, load it into database, provide some documentation, and visualize it.