The linear model

Correlation assumes nothing about the causal nature of the relationship between x and y
Often, we have a hypothesis that a variable is in some way functionally dependent on another
A simple linear regression describes/tests the relationship between
- the independent variable (x, aka predictor or input variable), and
- the dependent variable (y, aka response or outcome)

\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]

\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:

\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

Method of least squares

\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

We want to solve these equations for \(a\) and \(b\)

The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))

Method of least squares

\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]

\(n-2\) is constant with reference to \(a\) and \(b\)
Simpler to compute the Error (aka residual) Sum of Squares:

\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]

Ordinary least squares estimation

Solving for the minimum ESS yields estimates for the regression parameters:

\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]

These statistics have standard errors:

\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]

\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \] We can compute confidence intervals for \(a\) and \(b\) as:

\[ CI: a\pm t_{\frac{\alpha}{2},d.f.} s_a \] We here use \(d.f.=n-2\), where 2 represents the number of estimated parameters in the model. The analogue formula exists for \(b\). These computed confidence intervals can be used to check overlap with 0 and thereby declare a “significant” regression parameter. Confidence intervals for a particular parameter may also be computed for subsets in the data. Then, checking for overlap of CIs is a quick way to do a “homogeneity of slopes” comparison, for instance.

Significance tests

\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))

The ratio of variance explained to residual variance follows an \(F\) distribution with \(d.f._1=1\) and \(d.f._2=n-2\).

\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-2} \end{aligned}\]

Goodness of fit

The coefficient of determination tells you the proportion of variance explained by the model:

\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]

Regression assumptions

Data points are independent (i.i.d = independent and identically distributed): Y-values for any \(x\) shall not influence Y-values for any other \(x\). This is often the case in time series data or spatially autocorrelated data (think, e.g., about weather). Trends in the residuals (negative, positive, cyclical) indicate a violation of this assumption.
No error on X: Measurements of \(x\) must be obtained without error, they are assumed to be under full control of the researcher (in the sense of a treatment). Since this is often impossible, we usually assume that errors in the measurement of \(x\) are negligible or at least small compared with the measurement errors in \(y\).
Normal distribution of errors: For any value \(x\) there is a normally distributed population of \(y\) values. That is errors have to be normally distributed, NOT the variables X and Y themselves. This assumption is fulfilled, however, if \(x\) and \(y\) follow a multivariate normal.
Homoscedasticity (constant variance of the residuals: \(s_{\epsilon}\))
- as you move along x, the variance of values around \(\hat{y}\) must remain constant
Linearity: The relationship between \(x\) and \(y\) is linear
- valid if \(y\) is linear w.r.t some transformation: \(\hat{y} = a + b\mathcal{f}(x)\)
- more complicated in the reverse case: \(\mathcal{f}(\hat{y}) = a + bx\)

Transformations

Transformation can be used to linearize non-linear relationships
It can also improve normal distribution and variance homoscedasticity of the residuals
Use caution! The hypothesized relationship will change!

Some easily linearizable functions using logarithms

The logarithmic function is given by: \[ y=a+b\cdot \log{x} \]

The exponential function can be linearized by log-transformation:

\[ y=a\cdot e^{b\cdot x} \]

\[ \ln{y}=\ln{a}+bx \] Any potential function can be expressed as exponential function and then linearized:

\[ y=c\cdot a^{b\cdot x}=ce^{b\cdot \ln{a}\cdot x} \]

\[ \ln{y}=\ln{c}+b\cdot \ln{a}\cdot x \]

The power function can be linearized by log-transformation:

\[ y=a\cdot x^{b} \]

\[ \ln{y}=\ln{a}+b\cdot \ln x \] Note for plotting: When plotting logarithmic, exponential, potential or power functions and you wish for a linear graph, then either plot logarithmically transformed variables on linear scaled axes OR plot original variables on log-scaled axes. Take care of proper axis labelling.

Regression in R

## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17), 
        wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2, 
                        3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
## 
## Call:
## lm(formula = wing_length ~ age, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3070 -0.2154  0.0655  0.1632  0.2251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7131     0.1479    4.82  0.00053 ***
## age           0.2702     0.0135   20.03  5.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.218 on 11 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.971 
## F-statistic:  401 on 1 and 11 DF,  p-value: 5.27e-10
confint(mod)
##             2.5 % 97.5 %
## (Intercept) 0.388   1.04
## age         0.241   0.30

Regression in R: diagnostics

par(mfrow=c(1, 2), bty='n')

## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)


## also try
## plot(mod)

Regression in R: diagnostics

## also try
par(mfrow = c(2,2), bty='n', mar = c(3, 3, 2,0), mgp = c(2, 0.5, 0))
plot(mod)

Trends in residuals show non-linear relationships or non-independent data points
QQ-plots of standardized residuals allow check for normal distribution of errors
Plotting \(abs(res)\) or its square root allows check of homoscedasticity
Leverage is a measure for how extreme values of \(y\) are and indicate points of high influence on the final model (2 ways to interpret such data points: (i) check validity, (ii) get more data in that region)

Presenting regression output

Always:
- coefficients and standard errors
- degrees of freedom or \(n\)
- \(p\)-value, maybe \(F\) or \(t\) statistic
Recommended
- Some metric of goodness of fit (e.g., \(R^2\))
- Visualisation of the relationship
In supplemental material
- Diagnostic plots
- Models tested but not presented

We found a significant positive relationship between wing length and age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1).

Table 1. Parameter estimates for regression of wing length on age

	Estimate	St. error	95% CI
Intercept	0.71	0.15	[0.39, 1.03]
Age	0.27	0.013	[0.24, 0.30]

Linear Regression

The linear model

Method of least squares

Method of least squares

Ordinary least squares estimation

Significance tests

Goodness of fit

Regression assumptions

Transformations

Some easily linearizable functions using logarithms

Regression in R

Regression in R: diagnostics

Regression in R: diagnostics

Presenting regression output