background preloader

Data Repositories

Facebook Twitter

July 2013 – Anonymity (omnibus) | Pew Research Center. Crimes - 2001 to present | City of Chicago | Data Portal. NDL/FNIC Food Composition Database Home Page. Global Health Observatory Data Repository. Weka 3 - Data Mining with Open Source Machine Learning Software in Java. Datasets | Datamob: Public data put to good use. NLTK Data. Research:Data. This page is an overview of the various sources of open-licensed data published by the Wikimedia Foundation or about Wikimedia projects.

The information is intended to help community members, developers and researchers learn about available data sources and find the data they need for their work. If you have any questions, you might find the answer in the Frequently Asked Questions about Data. If you wish to donate or document any additional data sources, you can use the Wikimedia group on DataHub. See also Wikistats, Statistics and proposals. Quick Glance[edit] Data Dumps[edit] Home page[edit] Data dumps Description[edit] WMF publishes data dumps of Wikipedia and all WMF projects on a regular basis. Content[edit] Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual contentMedia bundles for each project, separated into files uploaded to the project and files from Commons Static HTML dumps for 2007-2008.

Repost from /r/startups: MovieLens for porn. Use it for whatever interesting learning. : MachineLearning. KONECT - The Koblenz Network Collection. Data, Data, Data: Thousands of Public Data Sources. We love data, big and small and we are always on the lookout for interesting datasets. Over the last two years, the BigML team has compiled a long list of sources of data that anyone can use.

It’s a great list for browsing, importing into our platform, creating new models and just exploring what can be done with different sets of data. In this post, we are sharing this list with you. Why? Well, searching for great datasets can be a time consuming task. Categories of data sources We grouped the links into some categories that calls ‘Bundles’ to help you find what you are looking for and bundled the Bundles into a single Data Sources Bundle.

Machine Learning Datasets Although many datasets can be used for machine learning tasks, the sources in this Bundle are specifically pre-processed for machine learning. Machine Learning Challenges Our next bundle of links contains links to Machine Learning Challenges. Marketplaces and data hubs Open companies Data search engines Data Journals. Twitter datasets | Datamob: Public data put to good use. Tweets2011 Twitter Collection. Large Network Dataset Collection. Social networks Networks with ground-truth communities Communication networks Citation networks Collaboration networks Web graphs Product co-purchasing networks Internet peer-to-peer networks Road networks Autonomous systems graphs Signed networks Location-based online social networks Wikipedia networks, articles, and metadata Temporal networks User Actions Memetracker and Twitter Online Communities Online Reviews Face-to-Face Communication Networks Graph classification datasets Network types Directed : directed network Undirected : undirected network Bipartite : bipartite network Multigraph : network has multiple edges between a pair of nodes Temporal : for each node/edge we know the time when it appeared in the network Labeled : network contains labels (weights, attributes) on nodes and/or edges Network statistics Citing SNAP We encourage you to cite our datasets if you have used them in your work.

Start [myPersonality Project] If you're here because of the news coverage: This wiki is aimed at researchers, although you're welcome to look around and see what we do. We also encourage you to try which predicts your personality based on your Facebook Likes. 2013-04-22 Added Smiley data in the download section myPersonality was a popular Facebook application that allowed users to take real psychometric tests, and us to record (with consent!) Their psychological and Facebook profile. Here we made available to the registered collaborators a wide variety of data, including: psychometric tests' scores, records of user's Facebook profiles, test item level data, some additional information - e.g. records of users' Likes. You will find more details about the available data in Download databases section.

To access to the data you need to register as a collaborator If you need help, want to clarify or elaborate on something, please write to Good luck with your research! Machine Learning Repository: Bag of Words Data Set. Source: David Newman newman '@' uci.eduUniversity of California, Irvine Data Set Information: For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words).

After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons. These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata. For each text collection we provide docword.*.txt (the bag of words file in sparse format) and vocab.*.txt (the vocab file). Enron Emails: orig source: D=39861 W=28102 N=6,400,000 (approx) NIPS full papers: orig source: D=1500 W=12419 N=1,900,000 (approx) Attribute Information: Relevant Papers: Webscope from Yahoo! Labs. Yahoo! Network flows data contains communication patterns between end-users in the large Internet and Yahoo servers. A netflow record includes timestamp, source IP address, destination IP address, source port, destination port, protocol, number of packets, and number of bytes transferred from the source to the destination.

The record does not include the content of the data communication. Each Nntflow data file consists of sampled netflow records exported from routers in 15-minute intervals. The dataset includes netflow data files collected from three border routers in October 11 2007. All IP addresses in the dataset are anonymized using a random permutation algorithm. Here are all the papers published on this Webscope Dataset: Constructing and Testing Privacy-Aware Services in a Cloud Computing Environment – Challenges and Opportunities (Invited Paper)Who are You Talking to? All datasets have been reviewed to conform to Yahoo!

' Department Head approval is not required. Reuters-21578 Text Categorization Test Collection. Webscope from Yahoo! Labs. The ClueWeb09 Dataset. The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. Dataset Specifications Web Pages: 1,040,809,705 web pages, in 10 languages 5 TB, compressed. (25 TB, uncompressed.) See the Record Counts Section on the Dataset Information and Sample Files page for detailed information on the distribution of records and languages. Web Graph: Information on how the crawl progressed is also available.

Dataset Distribution: The ClueWeb09 dataset and subsets are distributed in several different ways. Full, 4 x 1.5TB: The full dataset is distributed as tarred/gzipped files on four 1.5 terabyte (TB) hard disk drives, in Linux ext3 format. Web pages are in the WARC file format. Online Services Using A Hosted Copy of the ClueWeb09 Dataset Sign an Organizational Agreement . BBC Datasets - Machine Learning Group (UCD) Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research. These datasets are made available for non-commercial and research purposes only, and all data is provided in pre-processed matrix format.

If you make use of these datasets please reference the publication: Dataset: BBC All rights, including copyright, in the content of the original articles are owned by the BBC. Consists of documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.Documents: 2225, Terms: 9636Natural Classes: 5 (business, entertainment, politics, sport, tech) Download dataset Dataset: BBCSport Consists of documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005.Documents: 737, Terms: 4613Natural Classes: 5 (athletics, cricket, football, rugby, tennis) Download dataset File formats.

Text REtrieval Conference (TREC) Data. Datasets for Data Mining. Datasets for Data Mining, Analytics and Knowledge Discovery. See also Data repositories AssetMacro, historical data of Macroeconomic Indicators and Market Data. Awesome Public Datasets on github, curated by caesar0301.

AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. Related. Machine Learning Repository.