Gabriel Singer
21.01.2025
Multiple metric continuous independent predictors are used to predict one metric response variable.
Model formulation: linear combination of k predictors: \[ \mathbb{E}(y) =\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p \]
Using matrix notation:
\[ \hat{y} = \mathbf{X}\mathbf{B} \]
\(\mathbf{B}\) is the parameter vector
\(\mathbf{X}\) is the design matrix
\[ \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,p} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,p} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \dots & x_{n,p} \\ \end{bmatrix} \]
\[ \begin{aligned} \mathbb{E}(y) = \hat{y} & = \mathbf{X}\mathbf{B} \\ y &= \mathbf{X}\mathbf{B} + \epsilon \\ \epsilon & \sim \mathcal{N}(0, s_\epsilon) \end{aligned} \]
The equation is a linear system and can be solved with linear algebra by OLS, minimizing the sum of squared residuals:
\[ \mathrm{min}: \sum \epsilon^2 = \sum \left (\mathbf{X}\mathbf{B} - y \right)^2 \]
Do not compare p-values to determine which predictor is “better”
Rather, compare standardized effects
Either standardize all predictors (especially if \(s_{x_1} >> s_{x_2}\)) or standardize the coefficients.
\[ {b_j}^*=b_j\frac{s_{X_j}}{s_Y} \]
Higher \({b_j}^*\) means stronger influence of \(x_j\). Note that in software output \({b_j}^*\) is often referred to as \(\beta_j\).
To express uncertainty for a regression slope, use confidence intervals. For the t-statistic use \(d.f.=n-(p+1)\) where \(n\) is sample size and \(p\) is the number of involved predictors in the model.
BUT: adding predictors (even random numbers) will always increase \(R^2\) (by making the model more flexible)
One solution is Adjusted \(R^2\): penalises \(R^2\) for additional model complexity.
\[ {R^2}_{adj}=1-\frac{SS_{res}}{SS_{tot}}\times\frac{n-1}{n-p} \]
May be useful in particular cases, examples: \[ Flux=k \cdot \Delta C \] \[ \log Flux=\log k + \log \Delta C \] Fitting a model to a dataset here gives \(R^2=1\) with two fractions partitioned for \(\log k\) and \(\log \Delta C\).
qqnorm(residuals(mod))
,
qqline(residuals(mod))
, and plot(mod)
cor
on predictor matrix, check for
large correlationsWhy is this a problem?
Ignoring \(y\) for a moment, we can perform regressions of the \(x\) variables against each other:
\[ x_i = b_0 + b_1x_1 \dots b_kx_k +\epsilon \mathrm{~;~excluding~x_i} \]
Large \(R^2_i\) would argue for redundancy of \(x_i\) (its information is already contained in a linear combination of all other \(x\)-variables).
\(VIF_i\) is a transformation of \(R^2_i\):
\[ \mathrm{VIF}_i = \frac{1}{1-R^2_i} \] The VIF (name!) tells you by how much the SE of a regression coefficient increases due to inclusion of additional \(x\)-variables:
\[ s_b = s_{b,\rho=0}\sqrt{\mathrm{VIF}} \]
\(s_{b,\rho=0}\) is the standard error of a regression coefficient, assuming that all predictors are uncorrolated
# install.packages(car) # install the package, only need to do this once!
library(car)
## Loading required package: carData
full_mod = lm(bill_length_mm ~ bill_depth_mm + flipper_length_mm +
body_mass_g + sex, data = penguins)
## Error in eval(mf, parent.frame()): object 'penguins' not found
vif(full_mod)
## Error: object 'full_mod' not found
Procedure: if any VIF > 10, drop the variable with the largest VIF, repeat
Simplest approach: fit a full model with all predictors, then drop anything that is non-significant - at once or stepwise ;-)
Alternatively:
The best model in a set is the one that lets you accurately predict the value of a new unobserved outcome \(y_*\) based on the vector of inputs/predictors \(\mathbf{X}_*\).
We want to minimize the expected prediction error
\[ \mathbb{E}(\epsilon_*) = | \hat{y}_* - \mathbf{BX}^* | \]
We usually do not measure it, rather we approximate it with Akaike’s Information Criterion (AIC)
\[ AIC = n \cdot \ln SS_{err}+ 2(p+1)-n \cdot \ln n \] Like adjusted \(R^2\), it penalises models for complexity.
Lower AIC is better. The model with lowest AIC is called the most parsimonious model. Models that differ by <2 units in AIC are considered “equivalent” and usually reported as well.
Question: What measurements can be used to predict bill length?
full_mod = lm(bill_length_mm ~ bill_depth_mm + flipper_length_mm +
body_mass_g + sex, data = penguins)
summary(full_mod)
##
## Call:
## lm(formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm +
## body_mass_g + sex, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6273 -1.4293 -0.0548 1.1696 7.2860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6106378 8.6416123 0.071 0.94379
## bill_depth_mm 0.6735216 0.3385003 1.990 0.04902 *
## flipper_length_mm 0.1331484 0.0466317 2.855 0.00511 **
## body_mass_g 0.0015048 0.0007275 2.069 0.04085 *
## sexmale 0.5253570 0.7310479 0.719 0.47384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.136 on 114 degrees of freedom
## Multiple R-squared: 0.543, Adjusted R-squared: 0.5269
## F-statistic: 33.86 on 4 and 114 DF, p-value: < 2.2e-16
Question: What measurements can be used to predict bill length?
Does sex matter? Or can a sex effect be more simply explained by size?
step
in R will
build a model for you automatically, adding or dropping terms in an
attempt to find a model that minimises AICSuch models are prone to Overfitting:” irrelevant predictors that correlate with the outcomes by chance result in a model that fits the dataset well, but performs poorly when challenged with new data (low transferability).
mod1 = lm(brain_volume_cc ~ body_mass_kg, data = hom)
mod2 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2), data = hom)
mod3 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2) + I(body_mass_kg^3), data = hom)
mod4 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2) + I(body_mass_kg^3) + I(body_mass_kg^4), data = hom)
mod5 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2) + I(body_mass_kg^3) + I(body_mass_kg^4) + I(body_mass_kg^5), data = hom)
mod6 = lm(brain_volume_cc ~ body_mass_kg + I(body_mass_kg^2) + I(body_mass_kg^3) + I(body_mass_kg^4) + I(body_mass_kg^5) + I(body_mass_kg^6), data = hom)
Assume predictors \(X_1\) and \(X_2\) and limited sampling effort of n=16:
When \[ Y = b_0+b_1 \cdot X_1+b_2*X_2 \] \[ Y_{adj} = Y-b_0-b_2*X_2 \] then \(Y_{adj}\) is adjusted for the effect of \(X_2\) (and the intercept \(b_0\)) and may be meaningfully plotted against \(X_1\) (when this variable is considered more important).
One in 100,000 people have a condition, sigmocogititis, that causes you to think too much about statistics.
We have a test with a type 1 error rate (\(\alpha\)) = 0.05, and a type 2 error rate (\(\beta\)) of 0.
One in 100,000 people have a condition, sigmocogititis, that causes you to think too much about statistics.
We have a test with a type 1 error rate (\(\alpha\)) = 0.05, and a type 2 error rate (\(\beta\)) of 0.
I take the test, p < 0.05, and so I begin treating my condition.
One in 100,000 people have a condition, sigmocogititis, that causes you to think too much about statistics.
We have a test with a type 1 error rate (\(\alpha\)) = 0.05, and a type 2 error rate (\(\beta\)) of 0.
I take the test, p < 0.05, and so I begin treating my condition.
Problem: The probability that I have the disease is only 0.0002!
If we test 100,000 people with unknown status, 5% (5000) will test positive, but only one will have the condition. We made the wrong decision in 4999/5000 of these cases!
One in 100,000 people have a condition, sigmocogititis, that causes you to think too much about statistics.
We have a test with a type 1 error rate (\(\alpha\)) = 0.05, and a type 2 error rate (\(\beta\)) of 0.
I take the test, p < 0.05, and so I begin treating my condition.
Problem: The probability that I have the disease is only 0.0002!
If we test 100,000 people with unknown status, 5% (5000) will test positive, but only one will have the condition. We made the wrong decision in 4999/5000 of these cases!