background preloader

Datasets

Facebook Twitter

Google Dataset Search. City Observatory Data Portal. U.K. Indicators For The Sustainable Development Goals. GSS Trade Alpha. USMART. OpenPrescribing. Awesome Public Datasets. Data.gov. Data Sources · MRAN. NASA Data. SpaceX-API: Open Source REST API for rocket, core, capsule, pad, and launch data. Youth Homelessness DataBank. European Data Portal. Fun Data for teaching R. I’ll be running an R course soon and I am looking for fun (public) datasets to use in data manipulation and visualization.

Fun Data for teaching R

I would like to use a single dataset that has some easy variables for the first days, but also some more challenging ones for the final days. And I want that when I put exercises, the students* are curious about finding out the answer. [*in this case students are not ecologists] Ideas: -Movies. -Music. Road Safety Data. Open Data Barometer. Resources to Find the Data You Need, 2016 Edition. 25+ websites to find datasets for data science projects. Some datasets for teaching data science. In this post I describe the dslabs package, which contains some datasets that I use in my data science courses.

Some datasets for teaching data science

A much discussed topic in stats education is that computing should play a more prominent role in the curriculum. I strongly agree, but I think the main improvement will come from bringing applications to the forefront and mimicking, as best as possible, the challenges applied statisticians face in real life. I therefore try to avoid using widely used toy examples, such as the mtcars dataset, when I teach data science.

Rfordatascience/tidytuesday. Sports Data and R. Networkdata: R package containing several network datasets. 50+ free Datasets for Data Science Projects. [Updated as on Jan 31, 2020] 50+ free-datasets for your DataScience project portfolio There is no doubt that having a project portfolio is one of the best ways to master Data Science whether you aspire to be a data analyst, machine learning expert or data visualization ninja!

50+ free Datasets for Data Science Projects

In fact, students and job seekers who showcase their skills with a unique portfolio find it easier to land lucrative jobs faster than their peers! Our World in Data. NHS Scotland Open Data. NHS-R Community datasets package. This post briefly introduces an R package created for the NHS-R Community to help us learn and teach R.

NHS-R Community datasets package

Firstly, it is now available on CRAN, the major package repository for R, and can be installed like any other package, or directly from GitHub as follows: install.packages("NHSRdatasets") #or remotes::install_github(" Why? Several community members have mentioned the difficulties learning and teaching R using standard teaching datasets. Stock datasets like iris, mtcars, nycflights13 etc. are all useful, but they are out-of-context for most NHS, Public Health and related staff.

For those of us wanting to contribute to Open Source software, or practise using Git and GitHub, it also provides an opportunity to learn/practise these skills by contributing data. What’s in it? CKAN 2.8.2 documentation: DataStore extension. The CKAN DataStore extension provides an ad hoc database for storage of structured data from CKAN resources.

CKAN 2.8.2 documentation: DataStore extension

Data can be pulled out of resource files and stored in the DataStore. When a resource is added to the DataStore, you get: Automatic data previews on the resource’s page, using the Data Explorer extensionThe DataStore API: search, filter and update the data, without having to download and upload the entire data file The DataStore is integrated into the CKAN API and authorization system. The DataStore is generally used alongside the DataPusher, which will automatically upload data to the DataStore from suitable files, whether uploaded to CKAN’s FileStore or externally linked.

Relationship to FileStore¶ The DataStore is distinct but complementary to the FileStore (see FileStore and file uploads). Ckanr: R client for the CKAN API. ScotlandsData. Open Data Glasgow. OK Scotland. Scotland's Environment Web. Scotland’s Environment Web is developing a range of data visualisation applications that can be used by a wide variety of people with varying data interests and expertise - including the general public, teachers and students, policy officers and environmental assessment specialists.

Scotland's Environment Web

These applications read from significant amounts of raw data, in different formats and from multiple sources. They present data in an interactive format that helps users to relate, transform, manipulate and analyse to extract useful information. Live links are maintained to the published data so that automated updates are made on a regular basis. This ensures that the most up-to-date data is always presented. One of the major benefits of these tools are that the interactive graphs, images and data tables shown in the application can be exported as images, CSV files and PDF documents for use in reports and presentations. Improvement Service - Spatial Hub. 33 datasets found Air Quality Management Areas Description Local Authorities have a duty to designate any relevant areas where the air quality objectives are not (or are unlikely to be) being met as Air Quality Management Areas (AQMAs).

Improvement Service - Spatial Hub

AQMAs must be designated officially by means of an 'order'. The extent of the AQMA may be limited to the area of exceedance or encompass a larger area. Most data provided by local authorities is in polygon format. We have included date of AQMA declaration in our national schema, though many LAs do not currently provide this.Revoked AQMAs are now held in this dataset with a 'Date revoked' attribute. COVID-19-Management-Information. AI Playbook - Datasets. Kaggle Kaggle Kaggle includes nearly 600 'Featured' datasets that are well documented and prepped for ML analysis.

AI Playbook - Datasets

Any user or organization can publish data on Kaggle Datasets, and it includes classics like Iris as well as unique datasets published by our users Reuters Corpora (RCV1, RCV2, TRC2) NIST/Reuters In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems.

English Broadcast News Speech (HUB4) Johns Hopkins, Upenn The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. Multi-Domain Sentiment Dataset (version 2.0) The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains).

Fueling the Gold Rush: The Greatest Public Datasets for AI. It has never been easier to build AI or machine learning-based systems than it is today.

Fueling the Gold Rush: The Greatest Public Datasets for AI

The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS, Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee. Though not at the forefront of the AI hype train, the unsung hero of the AI revolution is data — lots and lots of labeled and annotated data, curated with the elbow grease of great research groups and companies who recognize that the democratization of data is a necessary step towards accelerating AI.

However, most products involving machine learning or AI rely heavily on proprietary datasets that are often not released, as this provides implicit defensibility. It’s important to remember that good performance on data set doesn’t guarantee a machine learning system will perform well in real product scenarios. Introducing Kaggle Datasets. At Kaggle, we want to help the world learn from data.

Introducing Kaggle Datasets

This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. Baidu has Released a Gigantic Self-Driving Dataset named ApolloScape.