background preloader

GitHub - caesar0301/awesome-public-datasets: An awesome list of high-quality open datasets in public domains (on-going).

GitHub - caesar0301/awesome-public-datasets: An awesome list of high-quality open datasets in public domains (on-going).
Related:  projets data scienceDatasets

Open Payments Data 5 Ways To Gain Real-World Data Science Experience Gaining data science experience without having a data science job seems daunting. One of the biggest questions I get from people trying to break in to data science is “how do I gain data science experience if I don’t have a data science job?”. To answer this question, I’ve put together the following top 5 ways to gain useful, real-world, data science experience: Build Small ProjectsVolunteer as a Data ScientistJoin a MeetupCreate TutorialsContribute to Open Source Projects I’ll go through each one of these in detail, and give you specific actions to take. You certainly don’t need to do everything on this list before you can start applying for data science jobs. My advice would be to focus on one or two things in this list, don’t try to do everything at once. Build Small Projects My first, and probably favorite method, for gaining real-world data science experience is to build small projects. I say “small” because it’s not a complete end-to-end project. 1. Here’s some sources of messy datasets:

Introducing Kaggle Datasets At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. Kaggle Datasets has four core components: Access: simple, consistent access to the data with clear licensingAnalysis: a way to explore the data without downloading itResults: visibility to the previous work that’s been created on the dataConversation: forums and comments for discussing the nuances of the data Are you interested in publishing one of your datasets on Access Simple, consistent access to the data You have a cleanly designed page with a several basic elements: a single download link to get the entire dataset, and a clear description of the data. Analysis A way to explore the data without downloading it Screenshot of a Kaggle Script being edited on the 2013 American Community Survey Results Conversation Seeding Kaggle Datasets

The Cure for Cancer Is Data—Mountains of Data A few years ago Eric Schadt met a woman who had cancer. It was an aggressive form of colon cancer that had come on quickly and metastasized to her liver. She was a young war widow from Mississippi, the mother of two girls she was raising alone, and she had only the health care that her husband’s death benefits afforded her—an overburdened oncologist at a military hospital, the lowest rung on the health care ladder. The polar opposite of cutting-edge medicine. To walk into such a facility with stage 4 metastatic disease is to walk back in time to the world of the unmapped human genome, when “colon cancer” was understood to have a single cause instead of millions of causes resulting in unique variations, when treatment was the same bag of poison, whether you were in Ocean Springs, Mississippi, or Timbuktu. A time without big data, machine learning, or hope. Schadt isn’t a cancer specialist or even a medical doctor. Seated at his desk at Mount Sinai, Schadt is direct and disarming. Apple

awesomedata/awesome-public-datasets: A topic-centric list of HQ open datasets. PR ☛☛☛ Fun Data for teaching R I’ll be running an R course soon and I am looking for fun (public) datasets to use in data manipulation and visualization. I would like to use a single dataset that has some easy variables for the first days, but also some more challenging ones for the final days. And I want that when I put exercises, the students* are curious about finding out the answer. [*in this case students are not ecologists] Ideas: -Movies. -Music. -Football: I discarded this one for me because I know nothing about it, but I am sure it will be highly popular in Spain. –Kaggel datasets are also awesome. –Earthquakes: This one also needs some parsing of the txt files (easier than IMDB) and will do for pretty visualizations. -Datasets already in R: Along with the classic datasets on Iris flowers (used by Fisher!) -Other: Internet is full of data like real time series, lots of small data examples, M&M’s colors by bag, Jeopardy questions, Marvel social networks, Dolphins social networks, … Like this: Like Loading...

OpenOil Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data by Avishek Nag (Machine Learning expert) A comparison of different classifiers’ accuracy & performance for high-dimensional data In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this ‘curse of dimensionality’ problem. In this article, we will see how accuracy and performance vary across different classifiers. Understanding the ‘datasource’ & problem formulation For this article, we will use the “EEG Brainwave Dataset” from Kaggle. So, to start with, let’s first read the data to see what’s there. There are 2549 columns in the dataset and ‘label’ is the target column for our classification problem. As per Kaggle, here is the challenge: “Can we predict emotional sentiment from brainwave readings?” Let’s first understand class distributions from column ‘label’: So, there are three classes, ‘POSITIVE’, ‘NEGATIVE’ & ‘NEUTRAL’, for emotional sentiment. RandomForest Classifier Conclusion

Scotland's Environment Web Scotland’s Environment Web is developing a range of data visualisation applications that can be used by a wide variety of people with varying data interests and expertise - including the general public, teachers and students, policy officers and environmental assessment specialists. These applications read from significant amounts of raw data, in different formats and from multiple sources. They present data in an interactive format that helps users to relate, transform, manipulate and analyse to extract useful information. Live links are maintained to the published data so that automated updates are made on a regular basis. One of the major benefits of these tools are that the interactive graphs, images and data tables shown in the application can be exported as images, CSV files and PDF documents for use in reports and presentations. If you have an idea for a new data visualisation tool, please fill out our request form and email to

IEEE DataPort - IEEE Big Data IEEE DataPort™ is now available for use! Go to to be connected to this valuable one-stop shop data repository serving the growing technical community focused on Big Data! Contact Melissa Handa today at for a coupon code to become a subscriber free of charge! Share, Access and Analyze Big Data with IEEE DataPort™! IEEE realizes that data generation and data analytics are increasingly critical in many aspects of research and industry. What Capabilities Does IEEE DataPort™ Provide? 1. 2. 3. 4. Get involved! Go to to load your first dataset today!

How to build an image classifier with greater than 97% accuracy by Anne Bonner A clear and complete blueprint for success How do you teach a computer to look at an image and correctly identify it as a flower? How do you teach a computer to see an image of a flower and then tell you exactly what species of flower it is when even you don’t know what species it is? Let me show you! This article will take you through the basics of creating an image classifier with PyTorch. What you do from here depends entirely on you and your imagination. I put this article together for anyone out there who’s brand new to all of this and looking for a place to begin. If you want to view the notebook, you can find it here. Because this PyTorch image classifier was built as a final project for a Udacity program, the code draws on code from Udacity which, in turn, draws on the official PyTorch documentation. Information about the flower data set can be found here. Let’s get started! Because I was using Colab, I needed to start by importing PyTorch. *** UPDATE! ! ! For example: !

OpenPrescribing asdfree Predicting Airline Delays – Jesse Steinweg-Woods, Ph.D. – Data Scientist Source I don’t know about all of you, but flying doesn’t always go smoothly for me. I have had some horror stories I could tell you about weird delays I have encountered while flying. Well, that’s what this project will attempt to do. To complete this project, we need some data about flights. Similar to the project about faculty salaries, this post will be split into two major parts: exploratory data analysis and feature engineering in R, with regression model implementation in Python. Getting the Data For this project, the best place to get data about airlines is from the US Department of Transportation, here. I only wished to include features that a user could enter at any time. As someone who has studied the weather for a very long time, trust me when I say the furthest out you can predict the weather at a specific location with any sort of accuracy is about a week. Let’s take a look at what our flightsDB dataframe contains to make sure there weren’t any issues. Min. 1st Qu.