Gabriel Singer
21.01.2025
Correlation assumes nothing about the causal nature of the relationship between x and y
Often, we have a hypothesis that a variable is in some way functionally dependent on another
A simple linear regression describes/tests the relationship between
\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]
\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:
\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]
\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]
We want to solve these equations for \(a\) and \(b\)
The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))
\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]
\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]
Solving for the minimum ESS yields estimates for the regression parameters:
\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]
These statistics have standard errors:
\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]
\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \] We can compute confidence intervals for \(a\) and \(b\) as:
\[ CI: a\pm t_{\frac{\alpha}{2},d.f.} s_a \] We here use \(d.f.=n-2\), where 2 represents the number of estimated parameters in the model. The analogue formula exists for \(b\). These computed confidence intervals can be used to check overlap with 0 and thereby declare a “significant” regression parameter. Confidence intervals for a particular parameter may also be computed for subsets in the data. Then, checking for overlap of CIs is a quick way to do a “homogeneity of slopes” comparison, for instance.
\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))
The ratio of variance explained to residual variance follows an \(F\) distribution with \(d.f._1=1\) and \(d.f._2=n-2\).
\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-2} \end{aligned}\]The coefficient of determination tells you the proportion of variance explained by the model:
\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]
The logarithmic function is given by: \[
y=a+b\cdot \log{x}
\]
The exponential function can be linearized by log-transformation:
\[ y=a\cdot e^{b\cdot x} \]
\[ \ln{y}=\ln{a}+bx \] Any potential function can be expressed as exponential function and then linearized:
\[ y=c\cdot a^{b\cdot x}=ce^{b\cdot \ln{a}\cdot x} \]
\[ \ln{y}=\ln{c}+b\cdot \ln{a}\cdot x \]
The power function can be linearized by log-transformation:
\[ y=a\cdot x^{b} \]
\[ \ln{y}=\ln{a}+b\cdot \ln x \] Note for plotting: When plotting logarithmic, exponential, potential or power functions and you wish for a linear graph, then either plot logarithmically transformed variables on linear scaled axes OR plot original variables on log-scaled axes. Take care of proper axis labelling.
## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17),
wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2,
3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
##
## Call:
## lm(formula = wing_length ~ age, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3070 -0.2154 0.0655 0.1632 0.2251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7131 0.1479 4.82 0.00053 ***
## age 0.2702 0.0135 20.03 5.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.218 on 11 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.971
## F-statistic: 401 on 1 and 11 DF, p-value: 5.27e-10
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 0.388 1.04
## age 0.241 0.30
par(mfrow=c(1, 2), bty='n')
## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)
We found a significant positive relationship between wing length and
age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1).
Estimate | St. error | 95% CI | |
Intercept | 0.71 | 0.15 | [0.39, 1.03] |
Age | 0.27 | 0.013 | [0.24, 0.30] |