Finding Data on the Internet

The following list of data sources has been modified as of 3/18/14. If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. Economics American Economic Ass. Data Science Practice This section contains data sets used in the book "Doing Data Science" by Rachel Schutt and Cathy O'Neil (O'Reilly 2014) Datasets on the book site: Enron Email Dataset: GetGlue (time stamped events: users rating TV shows): Titanic Survival Data Set: Half a million Hubway rides: Finance Government Health Care Gapminder: Machine Learning Networks Science

Datasets for Data Mining, Analytics and Knowledge Discovery See also Data repositories AssetMacro, historical data of Macroeconomic Indicators and Market Data. Related Where can I find large datasets open to the public? Public Data Sets A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use. Last Modified: Mar 17, 2014 17:51 PM GMT Three NASA NEX datasets are now available, including climate projections and satellite images of Earth. Last Modified: Nov 12, 2013 13:27 PM GMT The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available. Last Modified: Oct 8, 2013 14:38 PM GMT Last Modified: Oct 8, 2013 14:37 PM GMT Human Microbiome Project Data Set Last Modified: Sep 26, 2013 17:58 PM GMT The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available. Last Modified: Jul 18, 2012 16:34 PM GMT Last Modified: Apr 24, 2012 21:18 PM GMT Last Modified: Mar 4, 2012 3:22 AM GMT Last Modified: Feb 15, 2012 2:22 AM GMT Last Modified: Jan 21, 2012 2:12 AM GMT

Data Sets The Pew Research Center's Internet Project is pleased to offer scholars access to raw data sets from our research. All uses of this data should reference the Pew Research Center as the source of the data and acknowledge that the Pew Research bears no responsibility for interpretations presented or conclusions reached based on analysis of the data. Our data sets are made available as single compressed archive files (.zip file). Pew Research is interested in learning about other ways that scholars use our data. January 2014 – 25th Anniversary of the Web (Omnibus) This survey contains questions about internet usage, cell and smartphone ownership, and Americans’ views about the role of the internet in their lives. January 2014 – E-reading and Gadgets (Omnibus) This omnibus survey contains questions about reading, e-reading, and various electronic devices. October 2013 – Pictorial Activities (omnibus) July 2013 – Anonymity (omnibus) This omnibus survey contains questions about anonymity online.

Publicly Available Big Data Sets :: Hadoop Illuminated Public Data sets on Amazon AWS Amazon provides following data sets : ENSEMBL Annotated Gnome data, US Census data, UniGene, Freebase dump Data transfer is 'free' within Amazon eco system (within the same zone) AWS data sets InfoChimps InfoChimps has data marketplace with a wide variety of data sets. InfoChimps market place Comprehensive Knowledge Archive Network open source data portal platform data sets available on from Stanford network data collection Open Flights Crowd sourced flight data Flight arrival data

How to access 100M time series in R in under 60 seconds DataMarket, a portal that provides access to more than 14,000 data sets from various public and private sector organizations, has more than 100 million time series available for download and analysis. (Check out this presentation for more info about DataMarket.) And now with the new package rdatamarket, it's trivially easy to import those time series into R for charting, analysis, or anything. Here's what you need to do: Register an account on (it's free)Install the rdatamarket package in R with install.packages("rdatamarket")Browse for a time series of interest (I found this series on unemployment)Copy the URL of the page you're on (the short URL works too, I used " the dmseries function with the URL to extract the time series as a zoo object Here's an example: Created by Pretty R at With this package, you can go from finding interesting data on DataMarket to working with it in R in less than a minute.

Machine Learning Repository IT Operations Analytics In the fields of information technology and systems management, IT Operations Analytics (ITOA) is an approach or method applied to application software designed to retrieve, analyze and report data for IT operations. ITOA has been described as applying big data analytics to large datasets where IT operations can extract unique business insights.[1][2] In its Hype Cycle Report, Gartner rated the business impact of ITOA as being ‘high’, meaning that its use will see businesses enjoy significantly increased revenue or cost saving opportunities.[3] By 2017, Gartner predicts that 15% of enterprises will use IT operations analytics technologies to deliver intelligence for both business execution and IT operations.[2] Definition[edit] History[edit] Due the mainstream embrace of cloud computing and the increasing desire for businesses to adopt more Big Data practices, the ITOA industry has grown significantly since 2010. Applications[edit] Types[edit] Tools and ITOA Platforms[edit] See also[edit]

The 70 Online Databases that Define Our Planet Back in April, we looked at an ambitious European plan to simulate the entire planet. The idea is to exploit the huge amounts of data generated by financial markets, health records, social media and climate monitoring to model the planet’s climate, societies and economy. The vision is that a system like this can help to understand and predict crises before they occur so that governments can take appropriate measures in advance. There are numerous challenges here. Today, we get a grand tour of this challenge from Dirk Helbing and Stefano Balietti at the Swiss Federal Institute of Technology in Zurich. It turns out that there are already numerous sources of data that could provide the necessary fuel to power Helbing’s Earth Simulator. While good data from social sciences experiments has been hard to come by in the past, researchers are currently swamped by it thanks to a new generation of lab experiments, web experiments and the study of massive multi-player on-line games. Where’s George?

French National Election Study, 1995 Principal Investigator(s): Lewis-Beck, Michael S.; Mayer, Nonna; Boy, Daniel, et al. This national survey was conducted to study the attitudes and opinions of the French electorate during election year 1995. Information is provided on respondents' interest in politics, ideological leanings, voting behavior, party choice in the 1994 European elections, choice of presidential candidate in the first and second ballot of the 1995 French national elections, perceptions of the French presidential candidates' positions on the ideological spectrum and respondents... (more info) This national survey was conducted to study the attitudes and opinions of the French electorate during election year 1995.