background preloader

Data Repositories

Facebook Twitter

July 2013 – Anonymity (omnibus) Crimes - 2001 to present. NDL/FNIC Food Composition Database Home Page. Global Health Observatory Data Repository. Weka 3 - Data Mining with Open Source Machine Learning Software in Java. Datamob: Public data put to good use. NLTK Data. Research:Data. This page is an overview of the various sources of open-licensed data published by the Wikimedia Foundation or about Wikimedia projects.

Research:Data

The information is intended to help community members, developers and researchers learn about available data sources and find the data they need for their work. If you have any questions, you might find the answer in the Frequently Asked Questions about Data. If you wish to donate or document any additional data sources, you can use the Wikimedia group on DataHub. See also Wikistats, Statistics and proposals. Quick Glance[edit] Data Dumps[edit] Home page[edit] Data dumps Description[edit] WMF publishes data dumps of Wikipedia and all WMF projects on a regular basis. Content[edit] Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual contentMedia bundles for each project, separated into files uploaded to the project and files from Commons. Repost from /r/startups: MovieLens for porn. Use it for whatever interesting learning. : MachineLearning. KONECT - The Koblenz Network Collection. Data, Data, Data: Thousands of Public Data Sources.

We love data, big and small and we are always on the lookout for interesting datasets.

Data, Data, Data: Thousands of Public Data Sources

Over the last two years, the BigML team has compiled a long list of sources of data that anyone can use. It’s a great list for browsing, importing into our platform, creating new models and just exploring what can be done with different sets of data. In this post, we are sharing this list with you. Why? Well, searching for great datasets can be a time consuming task. Categories of data sources We grouped the links into some categories that bit.ly calls ‘Bundles’ to help you find what you are looking for and bundled the Bundles into a single Data Sources Bundle.

Datamob: Public data put to good use. Tweets2011 Twitter Collection. Large Network Dataset Collection. Start [myPersonality Project] If you're here because of the news coverage: This wiki is aimed at researchers, although you're welcome to look around and see what we do.

start [myPersonality Project]

We also encourage you to try which predicts your personality based on your Facebook Likes. 2013-04-22 Added Smiley data in the download section myPersonality was a popular Facebook application that allowed users to take real psychometric tests, and us to record (with consent!) Their psychological and Facebook profile. Here we made available to the registered collaborators a wide variety of data, including: psychometric tests' scores, records of user's Facebook profiles, test item level data, some additional information - e.g. records of users' Likes. You will find more details about the available data in Download databases section. To access to the data you need to register as a collaborator If you need help, want to clarify or elaborate on something, please write to contact@mypersonality.org Good luck with your research!

David Stillwell & Michal Kosinski. Machine Learning Repository: Bag of Words Data Set. Source: David Newman newman '@' uci.eduUniversity of California, Irvine Data Set Information: For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words).

Machine Learning Repository: Bag of Words Data Set

After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons. These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata. For each text collection we provide docword.*.txt (the bag of words file in sparse format) and vocab.*.txt (the vocab file). Enron Emails: orig source: www.cs.cmu.edu/~enron D=39861 W=28102 N=6,400,000 (approx) NIPS full papers: orig source: books.nips.cc D=1500 W=12419 N=1,900,000 (approx) Webscope from Yahoo! Labs. Yahoo!

Webscope from Yahoo! Labs

Network flows data contains communication patterns between end-users in the large Internet and Yahoo servers. A netflow record includes timestamp, source IP address, destination IP address, source port, destination port, protocol, number of packets, and number of bytes transferred from the source to the destination. The record does not include the content of the data communication.

Each Nntflow data file consists of sampled netflow records exported from routers in 15-minute intervals. The dataset includes netflow data files collected from three border routers in October 11 2007. Here are all the papers published on this Webscope Dataset: Constructing and Testing Privacy-Aware Services in a Cloud Computing Environment – Challenges and Opportunities (Invited Paper)Who are You Talking to? All datasets have been reviewed to conform to Yahoo! ' Department Head approval is not required. Dataset has been added to your cart View Cart. Reuters-21578 Text Categorization Test Collection. Webscope from Yahoo! Labs. The ClueWeb09 Dataset. The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies.

The ClueWeb09 Dataset

It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. Dataset Specifications Web Pages: 1,040,809,705 web pages, in 10 languages 5 TB, compressed. (25 TB, uncompressed.) BBC Datasets - Machine Learning Group (UCD) Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research.

BBC Datasets - Machine Learning Group (UCD)

Text REtrieval Conference (TREC) Data. Datasets for Data Mining. Datasets for Data Mining, Analytics and Knowledge Discovery. See also.

Datasets for Data Mining, Analytics and Knowledge Discovery

Machine Learning Repository.