background preloader

Cross Validated

Cross Validated

http://stats.stackexchange.com/

Related:  machine-learning-courseStatsData analysisData Science

Machine Learning for Developers by Mike de Waard Most developers these days have heard of machine learning, but when trying to find an 'easy' way into this technique, most people find themselves getting scared off by the abstractness of the concept of Machine Learning and terms as regression, unsupervised learning, Probability Density Function and many other definitions. If one switches to books there are books such as An Introduction to Statistical Learning with Applications in R and Machine Learning for Hackers who use programming language R for their examples. However R is not really a programming language in which one writes programs for everyday use such as is done with for example Java, C#, Scala etc. This is why in this blog machine learning will be introduced using Smile, a machine learning library that can be used both in Java and Scala.

Log Transformations for Skewed and Wide Distributions This is a guest article by Nina Zumel and John Mount, authors of the new book Practical Data Science with R . For readers of this blog, there is a 50% discount off the “Practical Data Science with R” book, simply by using the code pdswrblo when reaching checkout (until the 30th this month). Here is the post: Normalizing data by mean and standard deviation is most meaningful when the data distribution is roughly symmetric. In this article, based on chapter 4 of Practical Data Science with R , the authors show you a transformation that can make some distributions more symmetric. The need for data transformation can depend on the modeling method that you plan to use.

Zero Intelligence Agents — Drew Conway This happens to be one of those rare instances where the benefit of hindsight does not make me regret something said flippantly on a panel. I deeply believe that in order to truly change the world we cannot simply "throw analytics at the problem." To that end, the medical and health industries are perhaps the most primed to be disrupted by data and analytics. To be successful, however, a deep respect for both the methodological and clinical contexts of the data are required. It is incredibly exciting to be at an organization that is both working within the current framework of health care and data to create new insight for people, but also pushing the envelope with respect to individuals' relationships with their own health. The challenges are technical, sociological, and political; but the potential for innovation that exists in this space comes along very rarely.

R Programming Welcome to the R programming Wikibook This book is designed to be a practical guide to the R programming language[1]. R is free software designed for statistical computing. There is already great documentation for the standard R packages on the Comprehensive R Archive Network (CRAN)[2] and many resources in specialized books, forums such as Stackoverflow[3] and personal blogs[4], but all of these resources are scattered and therefore difficult to find and to compare. The aim of this Wikibook is to be the place where anyone can share his or her knowledge and tricks on R. It is supposed to be organized by task but not by discipline[5].

R Programming Welcome to the 'R' programming Wikibook This book is designed to be a practical guide to the R programming language[1]. R is free software designed for statistical computing. There is already great documentation for the standard R packages on the Comprehensive R Archive Network (CRAN)[2] and many resources in specialized books, forums such as Stackoverflow[3] and personal blogs[4], but all of these resources are scattered and therefore difficult to find and to compare. Machine Learning at Quora - Engineering at Quora - Quora Adapted from my original answer to How does Quora use machine learning in 2015? At Quora we have been using machine learning approaches for some time. We are constantly coming up with new approaches and making big improvements to the existing ones. It is important to note that all these improvements are first optimized and tested offline by using many different kinds of offline metrics but are always finally tested online through A/B tests.

Absolute Deviation Around the Median Median Absolute Deviation (MAD) or Absolute Deviation Around the Median as stated in the title, is a robust measure of central tendency. Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions. Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. This robustness is well illustrated by the median’s breakdown point Donoho & Huber, 1983. The interquartile range is also resistant to the influence of outliers, although the mean and median absolute deviation are better in that they can be converted into values that approximate the standard deviation. Essentially the breakdown point for a parameter (median, mean, variance, etc.) is the proportion or number of arbitrarily small or large extreme values that must be introduced into a sample to cause the estimator to yield an arbitrarily bad result.

The key word in “Data Science” is not Data, it is Science One of my colleagues was just at a conference where they saw a presentation about using data to solve a problem where data had previously not been abundant. The speaker claimed the data were "big data" and a question from the audience was: "Well, that isn't really big data is it, it is only X Gigabytes". While that exact question would elicit groans from most people who work with data, I think it highlights one of the key problems with the thinking around data science. Most people hyping data science have focused on the first word: data.

Welcome to a Little Book of R for Bioinformatics! — Bioinformatics 0.1 documentation By Avril Coghlan, Wellcome Trust Sanger Institute, Cambridge, U.K. Email: alc@sanger.ac.uk This is a simple introduction to bioinformatics, with a focus on genome analysis, using the R statistics software.

Related:  R Programming