Linear Regression

Gabriel Singer

21.01.2025

The linear model

\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]

\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:

\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

Method of least squares

\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

We want to solve these equations for \(a\) and \(b\)

The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))

Method of least squares

\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]

\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]

Ordinary least squares estimation

Solving for the minimum ESS yields estimates for the regression parameters:

\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]

These statistics have standard errors:

\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]

\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \] We can compute confidence intervals for \(a\) and \(b\) as:

\[ CI: a\pm t_{\frac{\alpha}{2},d.f.} s_a \] We here use \(d.f.=n-2\), where 2 represents the number of estimated parameters in the model. The analogue formula exists for \(b\). These computed confidence intervals can be used to check overlap with 0 and thereby declare a “significant” regression parameter. Confidence intervals for a particular parameter may also be computed for subsets in the data. Then, checking for overlap of CIs is a quick way to do a “homogeneity of slopes” comparison, for instance.

Significance tests

\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))

The ratio of variance explained to residual variance follows an \(F\) distribution with \(d.f._1=1\) and \(d.f._2=n-2\).

\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-2} \end{aligned}\]

Goodness of fit

The coefficient of determination tells you the proportion of variance explained by the model:

\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]

Regression assumptions

Transformations

Some easily linearizable functions using logarithms

The logarithmic function is given by: \[ y=a+b\cdot \log{x} \]

The exponential function can be linearized by log-transformation:

\[ y=a\cdot e^{b\cdot x} \]

\[ \ln{y}=\ln{a}+bx \] Any potential function can be expressed as exponential function and then linearized:

\[ y=c\cdot a^{b\cdot x}=ce^{b\cdot \ln{a}\cdot x} \]

\[ \ln{y}=\ln{c}+b\cdot \ln{a}\cdot x \]

The power function can be linearized by log-transformation:

\[ y=a\cdot x^{b} \]

\[ \ln{y}=\ln{a}+b\cdot \ln x \] Note for plotting: When plotting logarithmic, exponential, potential or power functions and you wish for a linear graph, then either plot logarithmically transformed variables on linear scaled axes OR plot original variables on log-scaled axes. Take care of proper axis labelling.

Regression in R

## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17), 
        wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2, 
                        3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
## 
## Call:
## lm(formula = wing_length ~ age, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3070 -0.2154  0.0655  0.1632  0.2251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7131     0.1479    4.82  0.00053 ***
## age           0.2702     0.0135   20.03  5.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.218 on 11 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.971 
## F-statistic:  401 on 1 and 11 DF,  p-value: 5.27e-10
confint(mod)
##             2.5 % 97.5 %
## (Intercept) 0.388   1.04
## age         0.241   0.30

Regression in R: diagnostics

par(mfrow=c(1, 2), bty='n')

## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)


## also try
## plot(mod)

Regression in R: diagnostics

## also try
par(mfrow = c(2,2), bty='n', mar = c(3, 3, 2,0), mgp = c(2, 0.5, 0))
plot(mod)

Presenting regression output

We found a significant positive relationship between wing length and age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1).

Table 1. Parameter estimates for regression of wing length on age
Estimate       St. error       95% CI
Intercept       0.71 0.15 [0.39, 1.03]
Age 0.27 0.013 [0.24, 0.30]