Datasets for Data Mining and Data Science

See also Data repositories AssetMacro, historical data of Macroeconomic Indicators and Market Data. Awesome Public Datasets on github, curated by caesar0301. AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. BigML big list of public data sources.

Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks Recognizing Objects with Deep Learning You might have seen this famous xkcd comic before. The goof is based on the idea that any 3-year-old child can recognize a photo of a bird, but figuring out how to make a computer recognize objects has puzzled the very best computer scientists for over 50 years. In the last few years, we’ve finally found a good approach to object recognition using deep convolutional neural networks. BBC Datasets - Machine Learning Group (UCD) Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research. These datasets are made available for non-commercial and research purposes only, and all data is provided in pre-processed matrix format. If you make use of these datasets please reference the publication: Dataset: BBC All rights, including copyright, in the content of the original articles are owned by the BBC. Consists of documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.Documents: 2225, Terms: 9636Natural Classes: 5 (business, entertainment, politics, sport, tech)

AWS Public Data Sets High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments. Last Modified: February 9, 2016 Three NASA NEX datasets are now available, including climate projections and satellite images of Earth. Last Modified: February 9, 2016

Public Data Sets on AWS Click here for the detailed list of available data sets. Here are some examples of popular Public Data Sets: NASA NEX: A collection of Earth science data sets maintained by NASA, including climate change projections and satellite images of the Earth's surface Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages 1000 Genomes Project: A detailed map of human genetic variation Google Books Ngrams: A data set containing Google Books n-gram corpuses US Census Data: US demographic data from 1980, 1990, and 2000 US Censuses Freebase Data Dump: A data dump of all the current facts and assertions in the Freebase system, an open database covering millions of topics The data sets are hosted in two possible formats: Amazon Elastic Block Store (Amazon EBS) snapshots and/or Amazon Simple Storage Service (Amazon S3) buckets. If you have any questions or want to participate in our Public Data Sets community, please visit our Public Data Sets forum.

The ClueWeb09 Dataset The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. Dataset Specifications Web Pages: Java API for WordNet Searching (JAWS) From within the application you started you can use JAWS by first obtaining an instance of WordNetDatabase with code like the following, which assumes that you've performed an import of the classes in the edu.smu.tspell.wordnet package: WordNetDatabase database = WordNetDatabase.getFileInstance(); Once you've done so, you can begin to retrieve synsets from the database as shown in the example below. This code retrieves all noun synsets for "fly" and loops through each one printing its first word form, its description, and the number of hyponyms associated with that noun synset: NounSynset nounSynset; NounSynset[] hyponyms;

