background preloader

Finding Data on the Internet

Finding Data on the Internet
Skip to Content A Community Site for R – Sponsored by Revolution Analytics Home » How to » Finding Data on the Internet Finding Data on the Internet By RevoJoe on October 6, 2011 The following list of data sources has been modified as of 3/18/14. If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. Economics American Economic Ass. Data Science Practice This section contains data sets used in the book "Doing Data Science" by Rachel Schutt and Cathy O'Neil (O'Reilly 2014) Datasets on the book site: Enron Email Dataset: GetGlue (time stamped events: users rating TV shows): Titanic Survival Data Set: Half a million Hubway rides: Finance Government Health Care Gapminder: Machine Learning Networks Science Comments Related:  Big Data / AnalyticsEstadistica

Datasets for Data Mining, Analytics and Knowledge Discovery See also Data repositories AssetMacro, historical data of Macroeconomic Indicators and Market Data. Related Where can I find large datasets open to the public? Journal of Statistics Education (JSE) Home Page Current Issue The November 2014 (Volume 22, Number 3) issue of JSE is now available. The table of contents can be accessed at: 2014 Table of Contents. This issue includes six regular articles, two Research on K-12 Statistics Education articles, two Teaching Bits, and an interview by Allan Rossman with Josh Tabor. As we normally do in our November issue, we have acknowledged all of the great referees who helped to review articles during the past year. We couldn't publish high quality articles without the help of our many reviewers, and we are extremely thankful for their time and effort. We hope you enjoy this issue, and, as always, we welcome your feedback. The JSE Webinar Series on CAUSEweb The JSE webinar series continues to take place approximately once each month, on the third Tuesday of the month, from 12 – 1 p.m. JSE on Facebook and Twitter There is also a Twitter account for JSE that you can follow if you use Twitter (@JStatEd). Paper Submissions and Author Guidelines

Public Data Sets A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use. Last Modified: Mar 17, 2014 17:51 PM GMT Three NASA NEX datasets are now available, including climate projections and satellite images of Earth. Last Modified: Nov 12, 2013 13:27 PM GMT The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available. Last Modified: Oct 8, 2013 14:38 PM GMT Last Modified: Oct 8, 2013 14:37 PM GMT Human Microbiome Project Data Set Last Modified: Sep 26, 2013 17:58 PM GMT The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available. Last Modified: Jul 18, 2012 16:34 PM GMT Last Modified: Apr 24, 2012 21:18 PM GMT Last Modified: Mar 4, 2012 3:22 AM GMT Last Modified: Feb 15, 2012 2:22 AM GMT Last Modified: Jan 21, 2012 2:12 AM GMT

Data Sets The Pew Research Center's Internet Project is pleased to offer scholars access to raw data sets from our research. All uses of this data should reference the Pew Research Center as the source of the data and acknowledge that the Pew Research bears no responsibility for interpretations presented or conclusions reached based on analysis of the data. Our data sets are made available as single compressed archive files (.zip file). Pew Research is interested in learning about other ways that scholars use our data. January 2014 – 25th Anniversary of the Web (Omnibus) This survey contains questions about internet usage, cell and smartphone ownership, and Americans’ views about the role of the internet in their lives. January 2014 – E-reading and Gadgets (Omnibus) This omnibus survey contains questions about reading, e-reading, and various electronic devices. October 2013 – Pictorial Activities (omnibus) July 2013 – Anonymity (omnibus) This omnibus survey contains questions about anonymity online.

Publicly Available Big Data Sets :: Hadoop Illuminated Public Data sets on Amazon AWS Amazon provides following data sets : ENSEMBL Annotated Gnome data, US Census data, UniGene, Freebase dump Data transfer is 'free' within Amazon eco system (within the same zone) AWS data sets InfoChimps InfoChimps has data marketplace with a wide variety of data sets. InfoChimps market place Comprehensive Knowledge Archive Network open source data portal platform data sets available on from Stanford network data collection Open Flights Crowd sourced flight data Flight arrival data

Create an SPSS data set Notes on the Missing Values Codes: What are missing values codes, and why do you need them? Sometimes in the collection of data there are values that are lost or cannot be gathered. These are called "missing values." When such values occur, it is important for the program to know that the values are missing so that statistical calculations may take this into account. Missing values are usually designated as an impossible value. For example, the missing values designated for the variable AGE may be -9, since it is impossible for the variable AGE to have the value -9. How to access 100M time series in R in under 60 seconds DataMarket, a portal that provides access to more than 14,000 data sets from various public and private sector organizations, has more than 100 million time series available for download and analysis. (Check out this presentation for more info about DataMarket.) And now with the new package rdatamarket, it's trivially easy to import those time series into R for charting, analysis, or anything. Here's what you need to do: Register an account on (it's free)Install the rdatamarket package in R with install.packages("rdatamarket")Browse for a time series of interest (I found this series on unemployment)Copy the URL of the page you're on (the short URL works too, I used " the dmseries function with the URL to extract the time series as a zoo object Here's an example: Created by Pretty R at With this package, you can go from finding interesting data on DataMarket to working with it in R in less than a minute.

Machine Learning Repository IT Operations Analytics In the fields of information technology and systems management, IT Operations Analytics (ITOA) is an approach or method applied to application software designed to retrieve, analyze and report data for IT operations. ITOA has been described as applying big data analytics to large datasets where IT operations can extract unique business insights.[1][2] In its Hype Cycle Report, Gartner rated the business impact of ITOA as being ‘high’, meaning that its use will see businesses enjoy significantly increased revenue or cost saving opportunities.[3] By 2017, Gartner predicts that 15% of enterprises will use IT operations analytics technologies to deliver intelligence for both business execution and IT operations.[2] Definition[edit] History[edit] Due the mainstream embrace of cloud computing and the increasing desire for businesses to adopt more Big Data practices, the ITOA industry has grown significantly since 2010. Applications[edit] Types[edit] Tools and ITOA Platforms[edit] See also[edit]