Load the LakeSPM dataset (Lindström, M., Håkanson, L., Abrahamsson, O., Johansson, H. (1999) An empirical model for prediction of lake water suspended particulate matter. Ecological Modelling 121, 185–198).
lakes = read.table("data/LakeSPM.txt", header=TRUE)
The study has a set of lakes with SPM=solid particulate matter as the main response of interest. High SPM is a water quality issue, it can be generated through excessive phytoplankton growth or resuspension of sediment in shallow lakes or terrigeneous input into smaller lakes. For water quality prediction a model to predict SPM from easily accessible lake data should be generated. The available predictors include
Formulate a set of hypotheses testing effects of productivity and lake depth on spm. Choose variables wisely, then run a formal hypothesis test using MLR. A suggestion is: H1: Productive lakes support phytoplankton growth, thus higher spm. H2: Shallow lakes allow resuspension, thus higher spm. Use variables TP and Dm, consider log-transformation (also of spm) to improve linearity.
These hypotheses are lame ;-) We know that nutrients influence
productivity since decades. For water quality forecasting a model to
predict spm is needed. Start an explorative approach with the aim to
design a good model for spm. Explore the data for potential
relationships with the response SPM. Assess distributions and chance for
collinearity by graphical means (e.g. using pairs
or
plot(data)
). Consider appropriate transformation to improve
linearity of relationships.
With vif
from the car
-package you can
assess collinearity. Identify redundant variables using VIF and drop
them from the dataset.
# drop the first two columns, non-numeric
lakes = lakes[, -c(1, 2)]
plot(lakes) # skewness issues!
# log transform everything except pH and Vd
lakes[, -c(5,10)] = log(lakes[, -c(5,10)])
plot(lakes)
# suggestion: make some histograms!
# dot (.) makes a model with all predictors
mlr.full = lm(spm ~ ., data = lakes)
summary(mlr.full)
library(car)
vif(mlr.full) #computes VIF
# look at how area is predicted by the X matrix
# R^2 is one!
summary(lm(area ~ . - spm, data = lakes))
# try a few reductions
# minus removes a predictor from the model
vif(lm(spm ~ . - area, data = lakes))
vif(lm(spm ~ . - area - Dm, data = lakes))
vif(lm(spm ~ . - area - Dm - Dmax, data = lakes))
# drop predictors that we killed with VIF
lakes = lakes[, -c(2:4)]
mf = lm(spm ~ ., data = lakes)
summary(mf)
step
to identify a good model based on the
AIC. Be sure to check diagnostics of the final model you get. Can you
make conclusions about significance?mod = step(mf, direction="both")
Try a leave-one-out cross validation approach to test the model’s predictive ability. The procedure should look like this:
Some syntax to get you started
n = nrow(lakes)
# create empty vectors to store predicted cases
preds_1 = numeric(n)
preds_2 = numeric(n)
# loop over every data point
for(i in 1:n) {
# drop data point i
lakes_sm = lakes[-i, ]
# fit the models
m1 = ??
# use the predict() function to get a model prediction
# see ?predict.lm for clues
preds_1[i] = predict(m1, ??)
}