Awesome Public Datasets

https://github.com/awesomedata/awesome-public-datasets

Related: Datasets

Introducing Kaggle Datasets At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. 70+ websites to get large data repositories for free Do you require GBs of data to check the performance of your app? The easiest way is to download samples of data from free data repositories available on the Web. But the main disadvantage of this approach is the data will have very less unique content and it may not give desired results. Below are 70+ websites to get large data repositories for free. Wikipedia:Database offers free copies of all available content to interested users. data is available in multiple languages.

5 Ways To Gain Real-World Data Science Experience Gaining data science experience without having a data science job seems daunting. One of the biggest questions I get from people trying to break in to data science is “how do I gain data science experience if I don’t have a data science job?”. To answer this question, I’ve put together the following top 5 ways to gain useful, real-world, data science experience: Build Small ProjectsVolunteer as a Data ScientistJoin a MeetupCreate TutorialsContribute to Open Source Projects I’ll go through each one of these in detail, and give you specific actions to take. You certainly don’t need to do everything on this list before you can start applying for data science jobs.

Fun Data for teaching R I’ll be running an R course soon and I am looking for fun (public) datasets to use in data manipulation and visualization. I would like to use a single dataset that has some easy variables for the first days, but also some more challenging ones for the final days. And I want that when I put exercises, the students* are curious about finding out the answer. 17 places to find datasets for data science projects If you’ve ever worked on a personal data science project, you’ve probably spent a lot of time browsing the internet looking for interesting data sets to analyze. It can be fun to sift through dozens of data sets to find the perfect one. But it can also be frustrating to download and import several csv files, only to realize that the data isn’t that interesting after all. Luckily, there are online repositories that curate data sets and (mostly) remove the uninteresting ones. In this post, we’ll walk through several types of data science projects, including data visualization projects, data cleaning projects, and machine learning projects, and identify good places to find data sets for each. Whether you want to strengthen your data science portfolio by showing that you can visualize data well, or you have a spare few hours and want to practice your machine learning skills, we’ve got you covered.

Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data by Avishek Nag (Machine Learning expert) A comparison of different classifiers’ accuracy & performance for high-dimensional data In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this ‘curse of dimensionality’ problem. In this article, we will see how accuracy and performance vary across different classifiers. Scotland's Environment Web Scotland’s Environment Web is developing a range of data visualisation applications that can be used by a wide variety of people with varying data interests and expertise - including the general public, teachers and students, policy officers and environmental assessment specialists. These applications read from significant amounts of raw data, in different formats and from multiple sources. They present data in an interactive format that helps users to relate, transform, manipulate and analyse to extract useful information. Live links are maintained to the published data so that automated updates are made on a regular basis.

Six Characteristics of a Great STEM Lesson Published Online: June 17, 2014 By Anne Jolly STEM is more than just a grouping of subject areas. Guide to Sample Data Sets – IBM Analytics Communities You can start getting familiar with Watson Analytics by using the sample data sets provided in this community. These data sets have all been tested with Watson Analytics, and are the basis for many of the Watson Analytics demonstrations and videos. A description of each is below. To use these data sets: Download a file from the links below.Log in to Watson Analytics.On the Watson Analytics Home page, tap New data to browse and select the data file.

How to build an image classifier with greater than 97% accuracy by Anne Bonner A clear and complete blueprint for success How do you teach a computer to look at an image and correctly identify it as a flower? How do you teach a computer to see an image of a flower and then tell you exactly what species of flower it is when even you don’t know what species it is? Let me show you! This article will take you through the basics of creating an image classifier with PyTorch.

Fueling the Gold Rush: The Greatest Public Datasets for AI It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS, Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee. Though not at the forefront of the AI hype train, the unsung hero of the AI revolution is data — lots and lots of labeled and annotated data, curated with the elbow grease of great research groups and companies who recognize that the democratization of data is a necessary step towards accelerating AI. However, most products involving machine learning or AI rely heavily on proprietary datasets that are often not released, as this provides implicit defensibility. It’s important to remember that good performance on data set doesn’t guarantee a machine learning system will perform well in real product scenarios.

Boxplots Boxplots can be created for individual variables or for variables by group. The format is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data. An example of a formula is y~group where a separate boxplot for numeric variable y is generated for each value of group. Add varwidth=TRUE to make boxplot widths proportional to the square root of the samples sizes. 88 IoT and Public Big Data Datasets: Finance, Insurance + More It is a commonly held belief that to gain valuable insight from Big Data, companies have to invest in collecting and analysing their own. However, those without the resources to do so, can actually support their IoT projects using public big data datasets. There are hundreds, if not thousands of these datasets open to the public. The data is free and available to anyone with an internet connection, allowing you to spot trends and patterns in global or local sectors. While the content of these IoT datasets varies considerably: ranging from government sources to private firms – there is accessible insight for a huge number of industries, including legal, financial services and healthcare.

Predicting Airline Delays – Jesse Steinweg-Woods, Ph.D. – Data Scientist Source I don’t know about all of you, but flying doesn’t always go smoothly for me. I have had some horror stories I could tell you about weird delays I have encountered while flying.