Multiple regression

Gabriel Singer

21.01.2025

Multiple linear regression model

Multiple metric continuous independent predictors are used to predict one metric response variable.

Model formulation: linear combination of k predictors: \[ \mathbb{E}(y) =\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p \]

Why doing a MLR? Two fundamentally different motivations:

Multiple linear regression model

Using matrix notation:

\[ \hat{y} = \mathbf{X}\mathbf{B} \]

\(\mathbf{B}\) is the parameter vector

\(\mathbf{X}\) is the design matrix

\[ \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,p} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,p} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \dots & x_{n,p} \\ \end{bmatrix} \]


\[ \begin{aligned} \mathbb{E}(y) = \hat{y} & = \mathbf{X}\mathbf{B} \\ y &= \mathbf{X}\mathbf{B} + \epsilon \\ \epsilon & \sim \mathcal{N}(0, s_\epsilon) \end{aligned} \]

The equation is a linear system and can be solved with linear algebra by OLS, minimizing the sum of squared residuals:

\[ \mathrm{min}: \sum \epsilon^2 = \sum \left (\mathbf{X}\mathbf{B} - y \right)^2 \]

Multiple linear regression: Hypotheses

  1. \(\mathbf{H_{0,regression}}\): All partial regression coefficients are zero.
    • The model (i.e., the linear system \(\mathbf{XB}\)) does not explain any of the variation in \(y\).
    • Test with ANOVA as with linear regression against a null model (with no predictors).
  2. \(\mathbf{H_{0,coef_j}}\): An individual partial regression coefficient, the slope of the relationship between \(y\) and \(x_j\), is zero.
    • with \(p\) predictors, there are \(p\) such hypotheses.
    • test with a \(t\)-statistic
    • build a confidence limit for the slope of \(x_j\)

Multiple linear regression: Hypotheses

  1. \(\mathbf{H_{0,regression}}\): All partial regression coefficients are zero.
    • The model (i.e., the linear system \(\mathbf{XB}\)) does not explain any of the variation in \(y\).
    • Test with ANOVA as with linear regression against a null model (with no predictors).
  2. \(\mathbf{H_{0,coef_j}}\): An individual partial regression coefficient, the slope of the relationship between \(y\) and \(x_j\), is zero.
    • with \(p\) predictors, there are \(p\) such hypotheses.
    • test with a \(t\)-statistic
    • build a confidence limit for the slope of \(x_j\)

Comparing effects using slopes

Do not compare p-values to determine which predictor is “better”

Rather, compare standardized effects

Either standardize all predictors (especially if \(s_{x_1} >> s_{x_2}\)) or standardize the coefficients.

\[ {b_j}^*=b_j\frac{s_{X_j}}{s_Y} \]

Higher \({b_j}^*\) means stronger influence of \(x_j\). Note that in software output \({b_j}^*\) is often referred to as \(\beta_j\).

To express uncertainty for a regression slope, use confidence intervals. For the t-statistic use \(d.f.=n-(p+1)\) where \(n\) is sample size and \(p\) is the number of involved predictors in the model.

Multiple linear regression: Goodness of fit

BUT: adding predictors (even random numbers) will always increase \(R^2\) (by making the model more flexible)

One solution is Adjusted \(R^2\): penalises \(R^2\) for additional model complexity.

\[ {R^2}_{adj}=1-\frac{SS_{res}}{SS_{tot}}\times\frac{n-1}{n-p} \]

Comparing effects using \(R^2\) partitioning

May be useful in particular cases, examples: \[ Flux=k \cdot \Delta C \] \[ \log Flux=\log k + \log \Delta C \] Fitting a model to a dataset here gives \(R^2=1\) with two fractions partitioned for \(\log k\) and \(\log \Delta C\).

Assumptions for MLR

Dealing with multicollinearity

Why is this a problem?

Variance inflation factors

Ignoring \(y\) for a moment, we can perform regressions of the \(x\) variables against each other:

\[ x_i = b_0 + b_1x_1 \dots b_kx_k +\epsilon \mathrm{~;~excluding~x_i} \]

Large \(R^2_i\) would argue for redundancy of \(x_i\) (its information is already contained in a linear combination of all other \(x\)-variables).

\(VIF_i\) is a transformation of \(R^2_i\):

\[ \mathrm{VIF}_i = \frac{1}{1-R^2_i} \] The VIF (name!) tells you by how much the SE of a regression coefficient increases due to inclusion of additional \(x\)-variables:

\[ s_b = s_{b,\rho=0}\sqrt{\mathrm{VIF}} \]

\(s_{b,\rho=0}\) is the standard error of a regression coefficient, assuming that all predictors are uncorrolated

VIF in R

# install.packages(car) # install the package, only need to do this once!
library(car)
## Loading required package: carData

full_mod = lm(bill_length_mm ~ bill_depth_mm + flipper_length_mm + 
                body_mass_g + sex, data = penguins)
## Error in eval(mf, parent.frame()): object 'penguins' not found
vif(full_mod)
## Error: object 'full_mod' not found

Procedure: if any VIF > 10, drop the variable with the largest VIF, repeat

Choosing predictors (models)

Simplest approach: fit a full model with all predictors, then drop anything that is non-significant - at once or stepwise ;-)

Alternatively:

Caveats:

Choosing predictors (models) based on information theory

The best model in a set is the one that lets you accurately predict the value of a new unobserved outcome \(y_*\) based on the vector of inputs/predictors \(\mathbf{X}_*\).

We want to minimize the expected prediction error

\[ \mathbb{E}(\epsilon_*) = | \hat{y}_* - \mathbf{BX}^* | \]

We usually do not measure it, rather we approximate it with Akaike’s Information Criterion (AIC)

\[ AIC = n \cdot \ln SS_{err}+ 2(p+1)-n \cdot \ln n \] Like adjusted \(R^2\), it penalises models for complexity.

Lower AIC is better. The model with lowest AIC is called the most parsimonious model. Models that differ by <2 units in AIC are considered “equivalent” and usually reported as well.

Choosing predictors (models) based on information theory

Question: What measurements can be used to predict bill length?

full_mod = lm(bill_length_mm ~ bill_depth_mm + flipper_length_mm + 
                body_mass_g + sex, data = penguins)
summary(full_mod)
## 
## Call:
## lm(formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm + 
##     body_mass_g + sex, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6273 -1.4293 -0.0548  1.1696  7.2860 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       0.6106378  8.6416123   0.071  0.94379   
## bill_depth_mm     0.6735216  0.3385003   1.990  0.04902 * 
## flipper_length_mm 0.1331484  0.0466317   2.855  0.00511 **
## body_mass_g       0.0015048  0.0007275   2.069  0.04085 * 
## sexmale           0.5253570  0.7310479   0.719  0.47384   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.136 on 114 degrees of freedom
## Multiple R-squared:  0.543,  Adjusted R-squared:  0.5269 
## F-statistic: 33.86 on 4 and 114 DF,  p-value: < 2.2e-16

Choosing predictors (models) based on information theory

Question: What measurements can be used to predict bill length?

Does sex matter? Or can a sex effect be more simply explained by size?

simple_mod = lm(bill_length_mm ~ bill_depth_mm + flipper_length_mm + 
                    body_mass_g, data = penguins)
AIC(full_mod) - AIC(simple_mod)
## [1] 1.46213

Model building

Such models are prone to Overfitting:” irrelevant predictors that correlate with the outcomes by chance result in a model that fits the dataset well, but performs poorly when challenged with new data (low transferability).

Validation to ensure a model is fit for prediction

Pitfalls: Model flexibility

hom = data.frame(
    body_mass_kg = c(34.5, 35.5, 37, 41.5, 55.4, 53.4, 60.9), 
    brain_volume_cc = c(652.4, 464.5, 448.8, 549.3, 819.9, 1540.4, 963.2)
)
mod1 = lm(brain_volume_cc ~ body_mass_kg, data = hom)

Pitfalls: Model flexibility

mod1 = lm(brain_volume_cc ~ body_mass_kg, data = hom)
mod2 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2), data = hom)

Pitfalls: Model flexibility

mod1 = lm(brain_volume_cc ~ body_mass_kg, data = hom)
mod2 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2), data = hom)
mod3 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2) + I(body_mass_kg^3), data = hom)
mod4 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2) + I(body_mass_kg^3) + I(body_mass_kg^4), data = hom)
mod5 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2) + I(body_mass_kg^3) + I(body_mass_kg^4) + I(body_mass_kg^5), data = hom)
mod6 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2) + I(body_mass_kg^3) + I(body_mass_kg^4) + I(body_mass_kg^5) + I(body_mass_kg^6), data = hom)

The curse of dimensionality


Assume predictors \(X_1\) and \(X_2\) and limited sampling effort of n=16:

  1. When studying only one predictor, we can cover its entire range of interest well.
  2. When studying two predictors with the same effort, our samples are dispersed in the 2D-space. We can´t get the same density but still cover the entire 2D-space in a regular grid.
  3. The more likely reality produces a dispersed distribution over the 2D-space with well and less well covered areas. The data becomes sparse. Maintaining high sampling density becomes increasingly difficult when more than two 2 dimensions are involved. We don´t cover our predictors well enough anymore!


Presenting Results

Always:

  1. A complete description of the model, all variables, and whatever variable selection method
  2. A table of regression parameters, std errors or confidence intervals, p-values
  3. \(R^2\) or another metric of goodness of fit

Graphical options

  1. Scatterplot-like representations not possible for >2 predictors
  2. Diagnostics (qq-plots, residulas vs fits) are even more important!
  3. Partial response curves: effect of each variable, holding others constant

When \[ Y = b_0+b_1 \cdot X_1+b_2*X_2 \] \[ Y_{adj} = Y-b_0-b_2*X_2 \] then \(Y_{adj}\) is adjusted for the effect of \(X_2\) (and the intercept \(b_0\)) and may be meaningfully plotted against \(X_1\) (when this variable is considered more important).

Last example

Last example

One in 100,000 people have a condition, sigmocogititis, that causes you to think too much about statistics.

We have a test with a type 1 error rate (\(\alpha\)) = 0.05, and a type 2 error rate (\(\beta\)) of 0.

Last example

One in 100,000 people have a condition, sigmocogititis, that causes you to think too much about statistics.

We have a test with a type 1 error rate (\(\alpha\)) = 0.05, and a type 2 error rate (\(\beta\)) of 0.

I take the test, p < 0.05, and so I begin treating my condition.

Last example

One in 100,000 people have a condition, sigmocogititis, that causes you to think too much about statistics.

We have a test with a type 1 error rate (\(\alpha\)) = 0.05, and a type 2 error rate (\(\beta\)) of 0.

I take the test, p < 0.05, and so I begin treating my condition.

Problem: The probability that I have the disease is only 0.0002!

If we test 100,000 people with unknown status, 5% (5000) will test positive, but only one will have the condition. We made the wrong decision in 4999/5000 of these cases!

Last example

One in 100,000 people have a condition, sigmocogititis, that causes you to think too much about statistics.

We have a test with a type 1 error rate (\(\alpha\)) = 0.05, and a type 2 error rate (\(\beta\)) of 0.

I take the test, p < 0.05, and so I begin treating my condition.

Problem: The probability that I have the disease is only 0.0002!

If we test 100,000 people with unknown status, 5% (5000) will test positive, but only one will have the condition. We made the wrong decision in 4999/5000 of these cases!