background preloader

Data Science

Facebook Twitter

Tony Chu sur Twitter : "Visualizing how a decision tree makes classifications: #d3js #machinelearning #ml. A Visual Introduction to Machine Learning. Finding better boundaries Let's revisit the 73-m elevation boundary proposed previously to see how we can improve upon our intuition.

A Visual Introduction to Machine Learning

Clearly, this requires a different perspective. By transforming our visualization into a histogram, we can better see how frequently homes appear at each elevation. While the highest home in New York is 73m, the majority of them seem to have far lower elevations. Your first fork A decision tree uses if-then statements to define patterns in data. For example, if a home's elevation is above some number, then the home is probably in San Francisco. In machine learning, these statements are called forks, and they split the data into two branches based on some value. That value between the branches is called a split point. Tradeoffs Picking a split point has tradeoffs. Look at that large slice of green in the left pie chart, those are all the San Francisco homes that are misclassified. The best split. Text Analytics.

Jonathan A. Frye sur Twitter : "Practical #DataScience in #Python: Guidebook via @DataScienceCtrl... Practical Data Science in Python: Guidebook. Dr. Diego Kuonen sur Twitter : "GREAT intro 2 'Statistical #Learning' > Slides & videos: #DataScience #ML #Statistics #BigData. In-depth introduction to machine learning in 15 hours of expert videos. In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR).

In-depth introduction to machine learning in 15 hours of expert videos

I found it to be an excellent course in statistical learning (also known as "machine learning"), largely due to the high quality of both the textbook and the video lectures. And as an R user, it was extremely helpful that they included R code to demonstrate most of the techniques described in the book. (Update: The course will be offered again in January 2016!) If you are new to machine learning (and even if you are not an R user), I highly recommend reading ISLR from cover-to-cover to gain both a theoretical and practical understanding of many important methods for regression and classification. It is available as a free PDF download from the authors' website. P.S. How to: Parallel Programming in R and Python [Video] IoT Python app with a Raspberry Pi and Bluemix. Dr. Diego Kuonen sur Twitter : "George E. P. Box in 1976 (!) on "#Science & #Statistics" > #DataScience #BigData.

Boxonmaths.pdf. My advise on what you need to do to become a data scientist... Kirk Borne sur Twitter : "The Building Blocks + Precursors of #DataScience: #abdsc #BigData #Statistics #MachineLearning. Building blocks of data science. Descriptive Predictive Prescriptive Analytics. The goal of Data Analytics (big and small) is to get actionable insights resulting in smarter decisions and better business outcomes.

Descriptive Predictive Prescriptive Analytics

How you architect business technologies and design data analytics processes to get valuable, actionable insights varies. It is critical to design and build a data warehouse / business intelligence (BI) architecture that provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and analysis of large and diverse datasets. Data Mining and Airline Safety. [Source: Handbook of Statistical Analysis and Data Mining; Nisbet, Elder, Miner, pp 378] Since 1980 however, the decline in fatalities has somewhat stabilized which probably indicates that new thinking and new safety approaches are needed to further push down the rate of fatalities.

Data Mining and Airline Safety

One such approach could be the use of data mining in determining the causes of fatalities so that preventative action may be taken. In this post, we will use publicly available data on airline safety to identify main causes of accidents and thereafter identify which the main predictors of accidents are. Needless to say, for the purposes of this post, we will keep our analysis simplistic merely to prove that data mining is a useful tool in conducting this analysis. Data Understanding We download the data sets from the website ( of the Federal Aviation Administration (FAA) which among other reports contains a series of reports called the Service Difficulty Reports (SDRs). Practical illustration of Map-Reduce (Hadoop-style), on real data. Here I will discuss a general framework to process web traffic data.

Practical illustration of Map-Reduce (Hadoop-style), on real data

The concept of Map-Reduce will be naturally introduced. Let's say you want to design a system to score Internet clicks, to measure the chance for a click to convert, or the chance to be fraudulent or un-billable. The data comes from a publisher or ad network; it could be Google. Conversion data is limited and poor (some conversions are tracked, some are not; some conversions are soft, just a click-out, and conversion rate is above 10%; some conversions are hard, for instance a credit card purchase, and conversion rate is below 1%). Here, for now, we just ignore the conversion data and focus on the low hangings fruits: click data. Here, we work with complete click data collected over a 7-day time period. The first step is to extract the relevant fields for this quick analysis (a few days of work).

IP addressDayUA (user agent) ID - so we created a look-up table for UA'sPartner IDAffiliate ID Here's the work around: