Below you will find a two tasks, similar to exercises we have done in class. Your task is to produce a document answering the numbered questions below using the tools you’ve learned in unit 2. Please produce a word/PDF document or similar with the answers organised by Task and question. If figures or tables are requested, you can embed them in this document. Note that figures and tables should be near publication quality - proper use of colours, labeled axes, units included, etc.
Please do not embed code in this document. Instead, please save your code in a separate R file. You can mark the task and question number using comments in the code. Turn in the code along with the writeup.
You may work on this individually or in small groups (maximal 3 students). Please turn this in to Gabriel (gabriel.singer@uibk.ac.at) by February 14, 2025 at the latest. If you are stuck, please get in touch by email to make an appointment well in advance.
We will use an unpublished dataset from Fred Omengo´s master thesis. You will meet the same dataset again when practicing multivariate methods.
Dataset description: 54 streams in the Mara River basin were investigated for water chemical conditions with a focus on quantity and quality of dissolved organic matter. In addition CO2-concentrations in the water were computed from pH and alkalinity. The streams were selected to vary in dominant landuse of their catchments and in stream size. In addition there were differences between streams in sampling time (arbitrary) and canopy cover (somewhat related to landuse). The water chemical variables include a range of anions and cations, specifically fractions of the most important nutrients N and P, besides conductivity and turbidity. Quantity of carbon available for microbial respiration is measured as DOC. Then there is a range of optical descriptors of dissolved organic matter (DOM) quality, these are all based on absorbance or fluorescence measurements. We here cannot go into detail of describing those DOM quality variables, just understand them as informing about molecular size, humification, proteinabundance, algal or soil origin, etc. For a slightly more detailed variable description see sheet metadata in Excel file.
Here is some code to load the data (which is located in
vu_datenanalyse_students/unit_2/data/
):
data = read.table("data/MaraRiver_full.txt", header = TRUE)
names(data)
dim(data)
Start with investigating effects of landuse, which is coded as a three-level factor: Catchments of streams were classified into categories A (=agriculture), M (=mixed) and F (=forest) depending on dominant landuse. Test whether landuse affects canopy cover (removal of vegetation!), TSS (soil erosion!), epCO2 (high respiration or lots of CO2 input if high). Note that the factor landuse can be considered as an “ordered factor”, plots should reflect this order.
In a second step of analysis, consider that stream size is a major influence for many variables and may have to be simultaneously accounted for in an analysis targeting landuse. The best proxy for stream size in this dataset is discharge (variable name: Q). Your aim should be to use ANCOVA.
Some streams are quite super-saturated in CO2, others less so. Identify potential controls among the available variables and build a model allowing to predict epCO2. Some indicators are straightforward and should be chosen, e.g. the amount of metabolizable DOC. Others have the character of a covariate (e.g. difference to solar noon indicating lack of light for primary production at the time of sampling).
The dataset has quite a high number of variables. Choosing predictors to predict epCO2 was not an easy task. Consider that some predictors could be condensed using PCA and then expressed in potentially meaningful metavariables (i.e. PCA scores), e.g. “anorganic chemistry” or “DOM quality”. As PCA-axes are orthogonal to each other (i.e. they are not correlated), they are well suited as input for MLR.