Solid particulate matter in lakes

Load the LakeSPM dataset (Lindström, M., Håkanson, L., Abrahamsson, O., Johansson, H. (1999) An empirical model for prediction of lake water suspended particulate matter. Ecological Modelling 121, 185–198).

lakes = read.table("data/LakeSPM.txt", header=TRUE)

The study has a set of lakes with SPM=solid particulate matter as the main response of interest. High SPM is a water quality issue, it can be generated through excessive phytoplankton growth or resuspension of sediment in shallow lakes or terrigeneous input into smaller lakes. For water quality prediction a model to predict SPM from easily accessible lake data should be generated. The available predictors include

  1. Formulate a set of hypotheses testing effects of productivity and lake depth on spm. Choose variables wisely, then run a formal hypothesis test using MLR. A suggestion is: H1: Productive lakes support phytoplankton growth, thus higher spm. H2: Shallow lakes allow resuspension, thus higher spm. Use variables TP and Dm, consider log-transformation (also of spm) to improve linearity.

  2. These hypotheses are lame ;-) We know that nutrients influence productivity since decades. For water quality forecasting a model to predict spm is needed. Start an explorative approach with the aim to design a good model for spm. Explore the data for potential relationships with the response SPM. Assess distributions and chance for collinearity by graphical means (e.g. using pairs or plot(data)). Consider appropriate transformation to improve linearity of relationships.

  3. With vif from the car-package you can assess collinearity. Identify redundant variables using VIF and drop them from the dataset.

# drop the first two columns, non-numeric
lakes = lakes[, -c(1, 2)]
plot(lakes) # skewness issues!

# log transform everything except pH and Vd
lakes[, -c(5,10)] = log(lakes[, -c(5,10)])
plot(lakes)

# suggestion: make some histograms!

# dot (.) makes a model with all predictors 
mlr.full = lm(spm ~ ., data = lakes) 
summary(mlr.full)
library(car)
vif(mlr.full) #computes VIF

# look at how area is predicted by the X matrix
# R^2 is one!
summary(lm(area ~ . - spm, data = lakes))

# try a few reductions
# minus removes a predictor from the model
vif(lm(spm ~ . - area, data = lakes))
vif(lm(spm ~ . - area - Dm, data = lakes))
vif(lm(spm ~ . - area - Dm - Dmax, data = lakes))
  1. Build up a full model using all the non-redundant predictors
# drop predictors that we killed with VIF
lakes = lakes[, -c(2:4)]
mf = lm(spm ~ ., data = lakes)
summary(mf)
  1. Use step to identify a good model based on the AIC. Be sure to check diagnostics of the final model you get. Can you make conclusions about significance?
mod = step(mf, direction="both")

Bonus

Try a leave-one-out cross validation approach to test the model’s predictive ability. The procedure should look like this:

  1. Take your best/favourite model from above, and also a second model for comparison.
  2. Drop the first data point from the lakes data frame.
  3. Fit both models on the new dataset with n-1 rows.
  4. Use the models to predict \(y\) for the left out case.
  5. Compute the squared error for the left out case for both models: \(sqerr = (y_{predicted} - y_{measured})^2\)
  6. Use a loop to repeat steps 2—5 \(n\) times, each time dropping a different data point. Save the error each time.
  7. Compute the root mean square error (RMSE) for each model: \(RMSE=\sqrt{\mathrm{mean}(sqerr)}\)
  8. Plot the observed vs the predicted values for each left-out case and both models

Some syntax to get you started

n = nrow(lakes)
# create empty vectors to store predicted cases
preds_1 = numeric(n)
preds_2 = numeric(n)
# loop over every data point
for(i in 1:n) {
    # drop data point i
    lakes_sm = lakes[-i, ]
    
    # fit the models
    m1 = ??
        
    # use the predict() function to get a model prediction
    # see ?predict.lm for clues
    preds_1[i] = predict(m1, ??)
}