background preloader

Data Science

Facebook Twitter

How to choose the best charts for your infographic - Venngage. One of the most important steps in creating infographics is choosing the right charts to tell your story. How do you pick the best charts to represent your data in a unique and eye-catching way to successfully deliver your message? What are the techniques you can use to visualize your information so that your data speaks for itself? Here are some tried and true tips from the frontlines: 1. How to Convey a Single Important Number Sometimes, you just want to convey a single data point. How to do tell this story?

Use Large Fonts and Labels There is no need for fancy charts here. Kobe Reaches 30000 points To get a Wow reaction with one number or data point, it’s often better to tell your audience in plain large text. Use large font to convey a single important number or point Use Pictograms or Icon Charts to Complement Percentages Although displaying large text works most of the time, having a visual to drive the point home can really help. 2. Use a Bar or Column Chart for basic comparisons 3. 100 open source Big Data architecture papers for data professionals. | Anil Madan.

Amazon. Amazon. Top 10 data mining algorithms in plain English. Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper. Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let’s get started! Update 16-May-2015: Thanks to Yuval Merhav and Oliver Keyes for their suggestions which I’ve incorporated into the post. Update 28-May-2015: Thanks to Dan Steinberg (yes, the CART expert!) For the suggested updates to the CART section which have now been added. What does it do? Wait, what’s a classifier? What’s an example of this? Now: Given these attributes, we want to predict whether the patient will get cancer.

And here’s the deal: Using a set of patient attributes and the patient’s corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes. The bottomline is: Learn From the Industry's Best - Big Data University. Fogs, logs and cogs: The newer, bigger shape of big data in the Internet of Things. Big data is becoming the next best thing to true magic. It is everywhere and, increasingly, nowhere specific. Every node in the known computing universe is becoming a component in a vast, distributed, pervasive big data cloud. As we transition to a world where clouds penetrate every facet of our lives, we need to wrap our heads around the thought that every edge node, no matter how resource-constrained, can be interconnected, intelligent and integral to the performance of the whole. What I’m sketching out is the vision of a world in which the Internet of Things (IoT) increasingly drives the evolution of cloud computing architectures.

In an IoT-centric world, nobody needs to know that your cloud’s processing, storage and other functions have been virtualized to endpoints of every size, configuration and capability. As the IoT cloud evolves in this direction, so will big data. This is the vision of "fog computing. " How will big data evolve in the era of IoT-centric fog computing? Fogs Logs. 6 dataset lists curated by data scientists | Mortar Blog | Data Science at Scale. Docs Blog 6 dataset lists curated by data scientists November 21, 2013 Scott Haylon Since we do a lot of experimenting with data, we’re always excited to find new datasets to use with Mortar.

We’re saving bookmarks and sharing datasets with our team on a nearly-daily basis. There are tons of resources throughout the web, but given our love for the data scientist community, we thought we’d pick out a few of the best dataset lists curated by data scientists. Below is a collection of six great dataset lists from both famous data scientists and those who aren’t well-known: 1) Pete Skomoroch most recently worked as a Research Scientist at LinkedIn. 2) Hilary Mason is a Data Scientist in Residence at Accel Partners (and one of Mortar’s advisors!).

3) Kevin Chai was most recently working as a research fellow for theCentre of Health Informatics at the University of New South Wales in Sydney, Australia. 4) Jeff Hammerbacher is Co-Founder and Chief Scientist of Cloudera. Tags datasets, data science, Blog. D3.js - Data-Driven Documents. Plotly. 66 job interview questions for data scientists. We are now at 91 questions. We've also added 50 new ones here, and started to provide answers to these questions here. These are mostly open-ended questions, to assess the technical horizontal knowledge of a senior candidate for a rather high level position, e.g. director. What is the biggest data set that you processed, and how did you process it, what were the results? Tell me two success stories about your analytic or computer science projects?

Related articles: Previous digest | Recent jobs | Top Links | Data Science eBook. Free Infographic Maker - Venngage. Visualize your resume in one click. In-depth introduction to machine learning in 15 hours of expert videos.

In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR). I found it to be an excellent course in statistical learning (also known as "machine learning"), largely due to the high quality of both the textbook and the video lectures. And as an R user, it was extremely helpful that they included R code to demonstrate most of the techniques described in the book. (Update: The course will be offered again in January 2016!)

If you are new to machine learning (and even if you are not an R user), I highly recommend reading ISLR from cover-to-cover to gain both a theoretical and practical understanding of many important methods for regression and classification. It is available as a free PDF download from the authors' website. P.S. Chapter 1: Introduction (slides, playlist)

Top 10 data mining algorithms in plain English. How to determine the quality and correctness of classification models? Part 2 - Quantitative quality indicators. Basic quantitative quality indicators In the last part of the tutorial we introduced the basic qualitative model quality indicators. Let us recall them now: Derived quality indicators We will now discuss derived variants of these indicators. TPR (True Positive Rate) – reflects the classifier’s ability to detect members of the positive class (pathological state) TNR (True Negative Rate) – reflects the classifier’s ability to detect members of the negative class (normal state) FPR (False Positive Rate) – reflects the frequency with which the classifier makes a mistake by classifying normal state as pathological FNR (False Negative Rate) – reflects the frequency with which the classifier makes a mistake by classifying pathological state as normal SE (sensitivity) – reflects the classifier’s ability to detect members of the positive class (pathological state) SP (specificity) – reflects the classifier’s ability to detect members of the negative class (normal state) Example.

60 new resources and articles about data science, IoT, machine learning, R, Python, big data. Probability Cheatsheet. A collection of links for streaming algorithms and data structures. Cinemas NOS. The Open Source Data Science Masters. Information Is Beautiful. CS109 Data Science. Learning from data in order to gain useful predictions and insights. This course introduces methods for five key facets of an investigation: data wrangling, cleaning, and sampling to get a suitable data set; data management to be able to access big data quickly and reliably; exploratory data analysis to generate hypotheses and intuition; prediction based on statistical methods such as regression and classification; and communication of results through visualization, stories, and interpretable summaries.

We will be using Python for all programming assignments and projects. All lectures will be posted here and should be available 24 hours after meeting time. The course is also listed as AC209, STAT121, and E-109. Lectures and Labs Lectures are 2:30-4pm on Tuesdays & Thursdays in Northwest B103 Labs are 10am-12pm on Fridays, Room: Geological Museum 100 Instructors Rafael Irizarry, Biostatistics Verena Kaynig-Fittkau, Computer Science Guest Lecturer Marc Streit Staff.