background preloader

Data analysis

Facebook Twitter

Forecasting: principles and practice. AMSC689: Research Interaction Team in Penalized Regression (Fall 2013) Description Major advances in technology for genomic studies are bringing the prospect of personalized and individualized medicine closer to reality.

AMSC689: Research Interaction Team in Penalized Regression (Fall 2013)

Many of these advances are predicated on the ability to generate data at an unprecedented rate, posing a significant need for computational data analysis that is clinically and biologically useful and robust. This reading course will concentrate on the fundamental statistical methods required to meet this need. The goal is to be familiar with the key articles and concepts in the field of penalized regression, as well as have a sense of the unanswered questions and current research directions. No prior knowledge of biology is required. Introduction to Statistical Learning. An Introduction to Statistical Learning with Applications in R Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani This book provides an introduction to statistical learning methods.

Introduction to Statistical Learning

Random forests - classification description. Contents Introduction Overview Features of random forests Remarks How Random Forests work The oob error estimate Variable importance Gini importance Interactions Proximities Scaling Prototypes Missing values for the training set Missing values for the test set Mislabeled cases Outliers Unsupervised learning Balancing prediction error Detecting novelties A case study - microarray data Classification mode Variable importance Using important variables Variable interactions Scaling the data Prototypes Outliers A case study - dna data Missing values in the training set Missing values in the test set Mislabeled cases Case Studies for unsupervised learning Clustering microarray data Clustering dna data Clustering glass data Clustering spectral data References Introduction This section gives a brief overview of random forests and some comments about the features of the method.

Random forests - classification description

Model Training and Tuning. The caret package has several functions that attempt to streamline the model building and evaluation process.

Model Training and Tuning

The train function can be used to evaluate, using resampling, the effect of model tuning parameters on performance choose the "optimal" model across these parameters estimate model performance from a training set First, a specific model must be chosen. Currently, 217 are available using caret; see train Model List or train Models By Tag for details.

The caret Package. How do I learn Machine Learning? Machine Learning Resources. Machine Learning. Statistics 36-350: Data Mining (Fall 2009) Cosma Shalizi Fall 2009 Important update, December 2011 If you are looking for the latest version of this class, it is 36-462, taught by Prof.

Statistics 36-350: Data Mining (Fall 2009)

Tibshirani in the spring of 2012. 36-350 is now the course number for Introduction to Statistical Computing. Data mining is the art of extracting useful patterns from large bodies of data; finding seams of actionable knowledge in the raw ore of information. The rapid growth of computerized data, and the computer power available to analyze it, creates great opportunities for data mining in business, medicine, science, government, etc. OpenIntro. R - Generalized linear Models. 5 Generalized Linear Models Generalized linear models are just as easy to fit in R as ordinary linear model.

R - Generalized linear Models

In fact, they require only an additional parameter to specify the variance and link functions. 5.1 Variance and Link Families The basic tool for fitting generalized linear models is the glm function, which has the folllowing general structure: GitHub Flavored Markdown - User Documentation. GitHub uses "GitHub Flavored Markdown," or GFM, across the site--in issues, comments, and pull requests.

GitHub Flavored Markdown - User Documentation

It differs from standard Markdown (SM) in a few significant ways, and adds some additional functionality. If you're not already familiar with Markdown, take a look at Markdown Basics. If you'd like to know more about features that are available in issues, comments, and pull request descriptions, such as task lists, read Writing on GitHub. Differences from traditional Markdown Multiple underscores in words. Markdown Basics. Getting the Gist of Markdown’s Formatting Syntax This page offers a brief overview of what it’s like to use Markdown.

Markdown Basics

The syntax page provides complete, detailed documentation for every feature, but Markdown should be very easy to pick up simply by looking at a few examples of it in action. The examples on this page are written in a before/after style, showing example syntax and the HTML output produced by Markdown. It’s also helpful to simply try Markdown out; the Dingus is a web application that allows you type your own Markdown-formatted text and translate it to XHTML. Note: This document is itself written using Markdown; you can see the source for it by adding ‘.text’ to the URL. Data Science Specialization. Managing a statistical analysis project – guidelines and best practices.

In the past two years, a growing community of R users (and statisticians in general) have been participating in two major Question-and-Answer websites:

Managing a statistical analysis project – guidelines and best practices

The Duke Saga Starter Set. Courses/05_ReproducibleResearch at master · rdpeng/courses. Determining the number of clusters in a data set. Statistics for Genomcs: Distances and Clustering. Cooperation between Referees and Authors Increases Peer Review Accuracy. Peer review is fundamentally a cooperative process between scientists in a community who agree to review each other's work in an unbiased fashion.

Cooperation between Referees and Authors Increases Peer Review Accuracy

Peer review is the foundation for decisions concerning publication in journals, awarding of grants, and academic promotion. Here we perform a laboratory study of open and closed peer review based on an online game. Json-mapping.pdf. XML.pdf. Tour.pdf. ShortIntro.pdf. XLConnect.pdf. Genolini-RBonnesPratiques.pdf. Statistics and the Science Club. One of my favorite movies is Woody Allen’s Annie Hall. If you’re my age and you haven’t seen it, I usually tell people it’s like When Harry Met Sally, except really good. The movie opens with Woody Allen’s character Alvy Singer explaining that he would “never want to belong to any club that would have someone like me for a member”, a quotation he attributes to Groucho Marx (or Freud). Last week I posted a link to ASA President Robert Rodriguez’s column in Amstat News about big data. In the post I asked what was wrong with the column and there were a few good comments from readers. In particular, Alex wrote:

Dan Meyer: Math class needs a makeover. The key word in “Data Science” is not Data, it is Science. One of my colleagues was just at a conference where they saw a presentation about using data to solve a problem where data had previously not been abundant. Zero Intelligence Agents — Drew Conway. This happens to be one of those rare instances where the benefit of hindsight does not make me regret something said flippantly on a panel. I deeply believe that in order to truly change the world we cannot simply "throw analytics at the problem. " To that end, the medical and health industries are perhaps the most primed to be disrupted by data and analytics.

To be successful, however, a deep respect for both the methodological and clinical contexts of the data are required. It is incredibly exciting to be at an organization that is both working within the current framework of health care and data to create new insight for people, but also pushing the envelope with respect to individuals' relationships with their own health. The challenges are technical, sociological, and political; but the potential for innovation that exists in this space comes along very rarely. I feel lucky to have an opportunity to move into the health data space now. In Head-Hunting, Big Data May Not Be Such a Big Deal.

Cross Validated. CRAN Task Views. Hilarymason.com. Junk Charts. Data Visualization, Infographics, and Statistics. Statistics Blogs @ StatsBlogs.com. Normal Deviate. Statistical Modeling, Causal Inference, and Social Science « Statistical Modeling, Causal Inference, and Social Science.

Statistics and the Science Club. ADAfaEPoV. Simulation.dvi - simulation_handout.pdf. Untitled - MPadilla_07.pdf. The importance of simulating the extremes. Simulation is commonly used by statisticians/data analysts to: (1) estimate variability/improve predictors , (2) to evaluate the space of potential outcomes , and (3) to evaluate the properties of new algorithms or procedures.

Over the last couple of days, discussions of simulation have popped up in a couple of different places. First, the reviewers of a paper that my student is working on had asked a question about the behavior of the method in different conditions. I mentioned in passing, that I thought it was a good idea to simulate some cases where our method will definitely break down. I also saw this post by John Cook about simple/complex models. Technical Methods Report: Guidelines for Multiple Testing in Impact Evaluations - Appendix B: Introduction to Multiple Testing. This appendix introduces the hypothesis testing framework for this report, the multiple testing problem, statistical methods to adjust for multiplicity, and some concerns that have been raised about these solutions. The goal is to provide an intuitive, nontechnical discussion of key issues related to this complex topic to help education researchers apply the guidelines presented in the report. Statistical significance for genomewide studies.

Bayesian Model Averaging Home Page. Bayesian Model Averaging Home Page Bayesian Model Averaging is a technique designed to help account for the uncertainty inherent in the model selection process, something which traditional statistical analysis often neglects. By averaging over many different competing models, BMA incorporates model uncertainty into conclusions about parameters and prediction.

BMA has been applied successfully to many statistical model classes including linear regression, generalized linear models, Cox regression models, and discrete graphical models, in all cases improving predictive performance. AdaBoost. Bootstrap aggregating. Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression.

Random forest. Random forests - classification description. Contents. Bootstrapping. Nonparametric Bootstrapping. Getting started with the `boot' package in R for bootstrap inference. The package boot has elegant and powerful support for bootstrapping. Lecture12.dvi - lecture12.pdf. ADAfaEPoV. Section-09.pdf. Section-06.pdf. Smoothing.pdf. Lecture-22. Untitled. Trees.pdf. Package boot. CvTools.pdf. Microsoft Word - CrossValidation1_JDS_May2011.docx - CrossValidation1_JDS_May2011.pdf. Practical machine learning: methods and algorithmics. Cohen's kappa. Peter Norvig - The Unreasonable Effectiveness of Data. Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition. Lars. Regression Models for Count Data in R - countreg. Log Linear Analysis. Main - oiStat2_08.pdf. Lecture 24 - lecture24.pdf. Lecture 23 - lecture23.pdf. R - Generalized linear Models. Data Visualization, Infographics, and Statistics.

Erreur de chargement de la page. R news & tutorials from the web. 14-reg-exp. Lecture 3. Tidy-data-pres.pdf. Gapminder: Unveiling the beauty of statistics for a fact based world view. Données publiques / Open Data. Tidy data - tidy-data.pdf. ProjectTemplate. The Duke Saga Starter Set. Twotorials by anthony damico. R code style guide. OpenIntro. ADAfaEPoV. Short-refcard. Cross Validated. Stack Overflow. Stack Overflow. Google's R Style Guide.