background preloader

Multiple linear regression

Facebook Twitter

R Tutorials--Multiple Regression. Preliminaries Model Formulae If you haven't yet read the tutorial on Model Formulae, now would be a good time! Statistical Modeling There is not space in this tutorial, and probably not on this server, to cover the complex issue of statistical modeling. For an excellent discussion, I refer you to Chapter 9 of Crawley (2007). Here I will restrict myself to a discussion of linear modeling. However, transforming variables to make them linear (see Simple Nonlinear Correlation and Regression) is straightforward, and a model including interaction terms is as easily created as it is to change plus signs to asterisks. Glm( ) for generalized linear models (covered in another tutorial) gam( ) for generalized additive models lme( ) and lmer( ) for linear mixed-effects models nls( ) and nlme( ) for nonlinear models and I'm sure there are others I'm leaving out My familiarity with these functions is "less than thorough" (!)

Warning: an opinion follows! Preliminary Examination of the Data Easy... PROPHET StatGuide: Possible alternatives if your data violate multiple linear regression assumptions. Different linear model: Y may actually be best modelled by a linear function that includes other variables in addition to the current set of X variables, or a subset of the current set of X variables, or a subset of the current set of X variables plus one or more new X variables. If a graph of the residuals against the prospective X variable suggests a linear trend, then adding the new X variable to the model may provide a better model.

A "new" X variable might be derived from one or more X variables already in the equation, such as using the square of X1 along with X1 to handle curvature in X1, or adding X1*X2 as a new variable to handle interaction between X1 and X2. In a situation of multicollinearity, a more useful model may actually involve removing one or more X variables, perhaps also adding one or more new ones. Nonlinear model: or an exponential model. Transformations: For p = -0.5 (reciprocal square root), 0, or 0.5 (square root), the data values must all be positive. Multiple Regression. General Purpose The general purpose of multiple regression (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable.

For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, you might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how "pretty" the house is (subjective rating). You may also detect "outliers," that is, houses that should really sell for more, given their location and characteristics. Salary = .5*Resp + .8*No_Super. PROPHET StatGuide: Do your data violate multiple linear regression assumptions? Often, the impact of an assumption violation on the multiple linear regression result depends on the extent of the violation (such as the how inconstant the variance of Y is, or how skewed the Y population distribution is).

Some small violations may have little practical effect on the analysis, while other violations may render the multiple linear regression result uselessly incorrect or uninterpretable. Implicit independent variables (covariates): Apparent lack of independence in the fitted Y values may be caused by the existence of an implicit X variable in the data, an X variable that was not explicitly used in the linear model. In this case, the best model may still be linear, but may not include all the original X variables. If there is a linear trend in the plot of the regression residuals against the fitted values, then an implicit X variable may be the cause. Another possible cause of apparent dependence between the Y observations is the presence of an implicit block effect. Multiple Regression. R provides comprehensive support for multiple linear regression. The topics below are provided in order of increasing complexity. Fitting the Model # Multiple Linear Regression Example fit <- lm(y ~ x1 + x2 + x3, data=mydata) summary(fit) # show results # Other useful functions coefficients(fit) # model coefficients confint(fit, level=0.95) # CIs for model parameters fitted(fit) # predicted values residuals(fit) # residuals anova(fit) # anova table vcov(fit) # covariance matrix for model parameters influence(fit) # regression diagnostics Diagnostic Plots Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations. # diagnostic plots layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page plot(fit) click to view For a more comprehensive evaluation of model fit see regression diagnostics.

Comparing Models You can compare nested models with the anova( ) function. Cross Validation You can assess R2 shrinkage via K-fold cross-validation. Variable Selection. Multicollinearity. Multiple Regression. Multiple regression. When to use it You use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent (Y) variable. The rest of the variables are the independent (X) variables. The purpose of a multiple regression is to find an equation that best predicts the Y variable as a linear function of the X variables. Multiple regression for prediction One use of multiple regression is prediction or estimation of an unknown Y value corresponding to a set of X values.

Multiple regression for understanding causes A second use of multiple regression is to try to understand the functional relationships between the dependent and independent variables, to try to see what might be causing the variation in the dependent variable. Null hypothesis How it works The basic idea is that an equation is found, like this: Yexp=a+b1X1+b2X2+b3X3... How well the equation fits the data is expressed by R2, the "coefficient of multiple determination. " Important warning Example. Correlation and linear regression. Introduction I find the descriptions of correlation and regression in most textbooks to be unnecessarily confusing. Some statistics textbooks have correlation and linear regression in separate chapters, and make it seem as if it is important to pick one technique or the other, based on subtle differences in the design and assumptions of the experiment. I think this overemphasizes the differences between them. Other books muddle correlation and regression together, leading the reader puzzled about what the difference is.

My understanding of the two techniques, as they are practiced, is that they primarily differ in goals. When you have two measurement variables in biology, you'll usually want to do both correlation and regression—you'll want the P-value of the hypothesis test, and the r2 that describes the strength of the relationship, and the regression line that illustrates the relationship. Here I'll treat correlation and linear regression as different aspects of a single analysis. Multiple Regression. Multiple Regression.