Boidiversity Modelling

library(sf)
library(terra)
library(ggplot2)
library(rnaturalearth)
library(scico)
library(gridExtra)

Today we will be exploring biodiversity using the IUCN rangemaps that we briefly explored in the first exercise. Note that this dataset includes all species for which the IUCN performs an assessment, and that, being expert-derived rangemaps, the resolution is rather coarse. Thus, our interpretations must be made with care, as they will necessarily represent approximate or potential diversity, rather than diversity as measured during (e.g.) a biodiversity inventory.

First let’s load and visualise the response variable. The mammal range maps, cropped to the same extent we have had before, are in the mammal_ranges.rds file. We can use the fasterize function to convert these individual range polygons into a map of species richness (by counting the number of polygons overlapping each grid cell). We use the climate data as the template for our raster.

mamm = readRDS("../data/biodiv/mammal_ranges.rds")
clim_pres = rast("../data/bioclim_pres.grd")
mamm_r = rasterize(mamm, clim_pres[[1]], fun = "sum", background = NA)

# fix a topological error with the countries
countries = st_make_valid(ne_countries(scale="medium", returnclass = "sf"))
# then crop countries to the data range
countries = st_crop(countries, mamm_r)

## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

mamm_pl = as.data.frame(mamm_r, xy=TRUE)
c_map = ggplot(countries) + geom_sf(colour='black', fill='white') + xlab("") + ylab("")
c_map + geom_tile(data = mamm_pl, aes(x=x, y=y, fill=layer), alpha = 0.8) + 
    scale_fill_scico(palette="davos") + labs(fill="Mammal Species Richness")

Some patterns in species richness should be immediately apparent, especially the gradient from lowlands to mountains.

1. Concept

Here I will propose a few modelling questions that we can pursue given the data we have with us for this exercise. However, other models are also possible; consider other hypotheses and pursue them following your own interests.

Some starter questions:

What is the shape of the relationship between diversity and elevation?
Are any climatic patterns evident after controlling for elevation?
How does human influence affect diversity?
Are there any notable spatial (e.g., latitudinal) gradients in richness after accounting for other factors?
Do any of the above patterns change if we consider only species that are listed as threatened or endangered?

We have several distributions we could try for modelling species richness. Two that are commonly used are the Poisson and the lognormal.

Poisson models are useful for count data, especially with small counts (lower than 50). About 50% of our raster cells have fewer than 50 species, so this model is sensible. However, it comes with some assumptions. In particular, we must assume that species richness is conditionally Poisson distributed; this means that, if we hold all of the predictors in our model constant, we get Poisson distributed species richness (meaning that the variance in species richness is equal to the mean). We can check the marginal distribution to start:

sr = mamm_pl[,3]
xx = 0:max(sr)
yy = dpois(xx, mean(sr))
hist(sr, main = "", xlab = "Species Richness", freq=FALSE, ylim=c(0, max(yy)))
lines(density(sr, bw=5), lwd=2, col='red')
lines(xx, yy, lwd=2, col='blue')
legend("topleft", legend=c("empirical", "theoretical", 
                           paste("mean =", round(mean(sr), 2)), paste("var =", round(var(sr), 2))),
       lwd=c(2,2,0,0), col=c('red', 'blue', 'white', 'white'))

So marginal species richness is not poisson distributed; the variance is too high and there are too many low-richness sites. Perhaps this will get better with other predictors, but we can at least check the lognormal as well to see if it fits best before adding predictors.

yy = dlnorm(xx, mean(log(sr)), sd(log(sr)))
hist(sr, main = "", xlab = "Species Richness", freq=FALSE)
lines(density(sr, bw=5), lwd=2, col='red')
lines(xx, yy, lwd=2, col='blue')
legend("topleft", legend=c("empirical", "theoretical", 
                           paste("mean =", round(mean(sr), 2)), paste("var =", round(var(sr), 2))),
       lwd=c(2,2,0,0), col=c('red', 'blue', 'white', 'white'))

The lognormal appears to be an even worse fit, with too few high-diversity areas and too many low-diversity sites. So we will stick with the Poisson, keeping in mind we should be on the lookout for overdispersion (excess variance in the residuals), indicating that our model assumptions might not be met.

2. Data preparation

Our main task here is to load in all of the data sources we would like to use, get them into a unified format (same projection, extent and resolution), and then transform them into a data.frame for further analysis.

2.1 Data sources

In addition to the mammal range maps and the climate data from previous exercises, we now have a digital elevation model from the Copernicus land monitoring service and the human influence index.

If you are interested in more “raw” land cover data, you could also check out the Corine Land Cover dataset. Note that the Corine data is quite large (100-m resolution for all Europe), and can be a bit difficult for slower computers.

elev = rast("../data/biodiv/elevation.grd")
hii = rast("../data/biodiv/hii.grd")

elev_pl = as.data.frame(elev, xy=TRUE)
hii_pl = as.data.frame(hii, xy=TRUE)
hii_pl$hii_v2geo = as.numeric(hii_pl$hii_v2geo)



c_map + geom_tile(data = elev_pl, aes(x=x, y=y, fill=dem_global_30sec)) + 
    scale_fill_viridis_c() + labs(fill = "Elevation (m)")

c_map + geom_tile(data = hii_pl, aes(x=x, y=y, fill=hii_v2geo)) + 
    scale_fill_scico(palette = "bamako") + labs(fill="Human Influence Index")

Reducing the resolution

The default dataset is quite large; here I provide some code to reduce the resolution for you. I do not run this by default here, but you may be interested in running this code if your computer struggles with the analyses.

## make a template raster
r = rast(ext(clim_pres), resolution = 0.25,crs = crs(clim_pres))

## reduce to a 0.5 degree grid
clim_pres = resample(clim_pres, r)

2.2 Map projections

A new problem is the map projection used. Previously, we have been ok to use latitude and longitude coordinates. However, our analysis now is accumulating species within grid cells; this analysis will be biased because the grid cells vary in area based on latitude (and the species area relationship tells us that larger areas have more species). Thus, we must reproject our species ranges into an equal area projection, recompute the species richness, and then reproject our predictor variables as well.

We will use the European Lambert equal area conic projection; this projection is recommended by the European Research Council as an appropriate tool for map-based statistical analysis. We use the project function, which can either operate using the spatial reference directly, or which can be given a template raster. In the case of a template raster, the output will be reprojected AND resampled so that the origin, extent, and resolution match the template.

## First set up a template raster with the right projection. We will use the coarest resolution raster
## out of everything we have, so that everything ends up lined up to the "worst" dataset
res(clim_pres)

## [1] 0.04166667 0.04166667

res(elev)

## [1] 0.008333334 0.008333334

res(hii)

## [1] 0.00833333 0.00833333

## climate is the coarsest, so we project it first
## I specify a resolution manually to preserve square pixels; 4km is about the size of the largest
## pixels in the original dataset
## for the original-scale raster
clim_pres = project(clim_pres, "+init=epsg:3035", res=4000)

## if using the coarse-resolution raster
# clim_pres = projectRaster(clim_pres, crs="+init=epsg:3035", res = 25000)

## the other rasters will use climate as the template
elev = project(elev, clim_pres)
hii = project(hii, clim_pres)

Now we also reproject species ranges, and recompute richness on the new grid.

mamm = st_transform(mamm, crs=crs(clim_pres))
mamm_r = rasterize(mamm, clim_pres[[1]], fun = "sum", background = NA)

## also transform countries for plotting
countries = st_transform(countries, crs=crs(clim_pres))
mamm_pl = as.data.frame(mamm_r, xy=TRUE)
c_map = ggplot(countries) + geom_sf(colour='black', fill='white') + xlab("") + ylab("")
c_map + geom_tile(data = mamm_pl, aes(x=x, y=y, fill=layer), alpha = 0.8) + 
    scale_fill_scico(palette="davos") + labs(fill="Mammal Species Richness")

2.3 Joining the data

The final step, joining the data, is trivial. The rasters are already aligned and in the same format, so we just use the stack function to collapse them into a single dataset, and then rasterToPoints to put them into a data.frame for analysis.

richness_df = c(richness = mamm_r, elev = elev, hii = hii, clim_pres)
richness_df = as.data.frame(richness_df, xy=TRUE)

## fix some column names
colnames(richness_df)[grep('layer', colnames(richness_df))] = 'richness'
colnames(richness_df)[grep('dem_global_30sec', colnames(richness_df))] = 'elevation'
colnames(richness_df)[grep('hii_v2geo', colnames(richness_df))] = 'hii'

## remove NAs (caused by differing raster extents in the original data)
richness_df = richness_df[complete.cases(richness_df), ]
richness_df$hii = as.numeric(richness_df$hii)
head(richness_df)

##            x       y     bio1    bio10     bio11    bio12    bio13 bio14
## 4372 5408538 4239204 4.558241 15.64333 -6.190136 657.7363 83.13522    32
## 5276 5380538 4235204 4.713551 15.84915 -5.973746 659.2147 82.23922    32
## 5277 5384538 4235204 4.694183 15.86233 -6.084814 662.7248 82.71275    32
## 5278 5388538 4235204 4.651457 15.74920 -6.042210 663.4534 82.94498    32
## 5279 5392538 4235204 4.614580 15.68862 -6.058820 665.9443 84.08150    32
## 5280 5396538 4235204 4.557180 15.66292 -6.162462 667.8897 84.00000    32
##         bio15    bio16    bio17    bio18    bio19     bio2     bio3     bio4
## 4372 32.31588 231.7216 100.1580 223.5308 127.0000 7.095840 22.47617 885.7181
## 5276 31.90178 228.4970 101.2175 214.9783 130.2766 7.313979 22.84487 883.1389
## 5277 31.96600 230.4255 101.9271 217.4255 129.9467 7.349348 22.89976 888.2964
## 5278 32.30125 231.7033 100.9576 219.3894 129.2165 7.303420 22.91375 882.3629
## 5279 32.71056 235.1630 100.8016 222.4225 128.4456 7.273415 22.87676 880.6664
## 5280 32.49564 235.9746 101.8237 223.4712 128.0000 7.286245 22.84390 884.0333
##          bio5      bio6     bio7     bio8      bio9 richness elevation hii
## 4372 21.26847 -10.30230 31.57077 14.15963 -1.820078       40  59.77922  19
## 5276 21.61501 -10.40020 32.01521 14.31686 -1.767455       40  23.85699  40
## 5277 21.64711 -10.44650 32.09361 14.32822 -1.779290       40  17.49591  31
## 5278 21.47877 -10.39480 31.87357 14.23267 -1.745999       40  26.96010  23
## 5279 21.41114 -10.38271 31.79385 14.18052 -1.742617       40  35.17223  23
## 5280 21.43798 -10.45639 31.89438 14.15342 -1.811940       40  48.03083  23

The objects created here are available on OLAT. If you aren’t able to run the code above, just download them and read them in as follows, then continue with the code below.

3. Model building

Here we begin our model building process; we will start with a GLM and a single predictor, as we have previously, this time using the Poisson family because of our count data. Let’s start by looking for a diversity gradient with precipitation.

cols = scales::hue_pal()(4)
mod1 = glm(richness ~ bio12, family = "poisson", data = richness_df)
summary(mod1)

## 
## Call:
## glm(formula = richness ~ bio12, family = "poisson", data = richness_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.892e+00  7.599e-04 5121.89   <2e-16 ***
## bio12       9.989e-06  9.509e-07   10.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 759618  on 283746  degrees of freedom
## Residual deviance: 759508  on 283745  degrees of freedom
## AIC: 2379194
## 
## Number of Fisher Scoring iterations: 4

prdat = data.frame(bio12 = seq(min(richness_df$bio12), max(richness_df$bio12), length.out=100))
prdat$richness = predict(mod1, newdata=prdat, type='response')

pts = ggplot(data = richness_df) + geom_point(aes(x=bio12, y=richness), size = 0.05) + 
    xlab("Annual Precipitation")
m1_p = pts + geom_line(data = prdat, aes(x=bio12, y=richness), size = 1.4, colour = cols[1])

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

m1_p

This model certainly seems inadequate. There is very high scatter around the line, and the relationship seems highly nonlinear. Interestingly, we see an decrease in richness with precipitation.

We could continue as before, fitting polynomials to our data, but instead perhaps it makes sense to fit a different kind of model.

3.1 Generalised Additive Models

Generalised additive models are an extension of GLMs If a typical GLM fits a linear equation:

$L(\mathbb{E}(y)) = a + b_1x_1 + b_2x_2 + ... b_kx_k$

then a GAM extends this to fitting non-linear (smooth) functions of each x-variable:

$L(\mathbb{E}(y)) = a + f_1(x_1) + f_2(x_2) + ... f_k(x_k)$

Fitting a gam in R is simple.

library(gam)

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.22-5

mod2 = gam(richness ~ s(bio12), family = "poisson", data = richness_df)
summary(mod2)

## 
## Call: gam(formula = richness ~ s(bio12), family = "poisson", data = richness_df)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6212 -0.7910  0.2464  0.9667  6.6274 
## 
## (Dispersion Parameter for poisson family taken to be 1)
## 
##     Null Deviance: 759618.3 on 283746 degrees of freedom
## Residual Deviance: 691939.2 on 283742 degrees of freedom
## AIC: 2311631 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## s(bio12)       1    301 300.897  136.38 < 2.2e-16 ***
## Residuals 283742 626023   2.206                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar Chisq    P(Chi)    
## (Intercept)                                 
## s(bio12)          3      64030 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

prdat$richness2 = predict(mod2, newdata=prdat, type='response')
m2_p = m1_p + geom_line(data = prdat, aes(x=bio12, y=richness2), size = 1.4, colour = cols[3])
m2_p

The GAM produces a much curvier fit, and one that shows a hump in richness at intermediate precipitations. It certainly seems to be a better match to the cloud of points, but there is still a lot of scatter.

Note the s function in the GAM call. This stands for “smooth.” By specifying this, we are asking for a smooth curve to be fit to a variable. GAM can also fit linear terms, by including them as we do in a GLM. Also note the df parameter; this controls how curvy the line can be; if df=1, then the fit would be linear.

3.2 Spatial effects and contingency

We notice a lot of empty space on the above plot, an in particular a lot of strange gaps. Some precipitation values are not well-represented in the data, and there are some precip-richness combinations that just don’t exist. Thus, there could be a lot of things not related to precipitation that influence how this relationship looks. In particular, because we analyse diversity at every point on the map, we must be very careful in interpreting the results. Certainly p-values are biased and not to be trusted, but even the slope and/or shape of the relationship could be influenced quite a lot by confounding factors.

Naturally, we can account for this by including every important factor in our model. However, this presents at least two problems. First, it is certainly impossible to measure all important factors controlling diversity. Second, some of the factors at this scale are certainly related directly to geography, notably biogeographic explanations for species richness. Our study area includes numerous islands, deserts, and mountain ranges that present significant barriers to the movement of species, so we must consider geography in our analysis. There are a number of ways of fitting spatial models, here we consider a simple one using the GAM smoother. We use a high number of degrees of freedom to allow for lots of geographic contingency, and we include both terms in a single smoother to approximate a spline surface rather than a curve. When we visualise the curve against precipitation, we do it for the geographic centre of our study area

mod3 = gam(richness ~ s(bio12) + s(x, df=5) + s(y, df=5) + s(x*y, df=5), family = "poisson", data = richness_df)
summary(mod3)

## 
## Call: gam(formula = richness ~ s(bio12) + s(x, df = 5) + s(y, df = 5) + 
##     s(x * y, df = 5), family = "poisson", data = richness_df)
## Deviance Residuals:
##       Min        1Q    Median        3Q       Max 
## -10.23808  -0.33074   0.07579   0.40144   2.53041 
## 
## (Dispersion Parameter for poisson family taken to be 1)
## 
##     Null Deviance: 759618.3 on 283746 degrees of freedom
## Residual Deviance: 190368.3 on 283727 degrees of freedom
## AIC: 1810090 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                      Df Sum Sq Mean Sq F value    Pr(>F)    
## s(bio12)              1  20816   20816   35124 < 2.2e-16 ***
## s(x, df = 5)          1  52267   52267   88194 < 2.2e-16 ***
## s(y, df = 5)          1  38669   38669   65250 < 2.2e-16 ***
## s(x * y, df = 5)      1 129707  129707  218863 < 2.2e-16 ***
## Residuals        283727 168148       1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                  Npar Df Npar Chisq    P(Chi)    
## (Intercept)                                      
## s(bio12)               3       6162 < 2.2e-16 ***
## s(x, df = 5)           4      26540 < 2.2e-16 ***
## s(y, df = 5)           4     235145 < 2.2e-16 ***
## s(x * y, df = 5)       4       4934 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

prdat$x = (max(richness_df$x) - min(richness_df$x)) / 2 + min(richness_df$x)
prdat$y = (max(richness_df$y) - min(richness_df$y)) / 2 + min(richness_df$y)
prdat$richness3 = predict(mod3, newdata=prdat, type='response')
m3_p = m2_p + geom_line(data = prdat, aes(x=bio12, y=richness3), size = 1.4, colour = cols[4])
m3_p

Our new curve is looking much more sensible, richness generally increases with precipitation until it levels off. This suggests the decrease we saw with ther earlier models is due to other variables. Now we will try including them.

3.2 Model selection

Now comes the task of deciding which variables to include. Fortunately, AIC is also well-defined for GAM models, so we can use this as the basis of our comparison. Here we fit a full model, considering elevation, human influence, temperature, precipitation, temperature seasonality, and precipitation seasonality. We then follow this with a series of models dropping a single parameter at a time (not including space, we always retain this), and compare them all in a table using AIC.

Note that these are big models, so it is normal for this code to take some time to run.

mod_f = gam(richness ~ s(bio1) + s(bio4) + s(bio12) + s(bio15) + s(elevation) + s(hii) + 
            s(x, df=5) + s(y, df=5) + s(x*y, df=5), family = "poisson", data = richness_df)
mod_dbio1 = update(mod_f, ~.-s(bio1))
mod_dbio4 = update(mod_f, ~.-s(bio4))
mod_dbio12 = update(mod_f, ~.-s(bio12))
mod_dbio15 = update(mod_f, ~.-s(bio15))
mod_delevation = update(mod_f, ~.-s(elevation))
mod_dhii = update(mod_f, ~.-s(hii))

aic_tab = data.frame(names = c("full", "-bio1", "-bio4", "-bio12", "-bio15", "-elevation", "-hii"), 
                     aic = c(AIC(mod_f), AIC(mod_dbio1), AIC(mod_dbio4), AIC(mod_dbio12), AIC(mod_dbio15), AIC(mod_delevation), AIC(mod_dhii)))
aic_tab$daic = aic_tab$aic - min(aic_tab$aic)
aicw_num = exp(-0.5 * aic_tab$daic)
aic_tab$weight = aicw_num / sum(aicw_num)

knitr::kable(aic_tab, digits = 3)

names	aic	daic	weight
full	1787355	0.000	1
-bio1	1789873	2518.019	0
-bio4	1795769	8414.134	0
-bio12	1790064	2709.369	0
-bio15	1789806	2450.888	0
-elevation	1788118	762.607	0
-hii	1787390	34.675	0

Here we compute AIC for each model, along with $\Delta \mathrm{AIC}$ and the Akaike weight. AIC alone is related to the log likelihood of the model, after penalising for model complexity; a lower AIC indicates a better-fitting model. $\Delta \mathrm{AIC}$ can be used to compare models within a single set. Here, the best model in the set will always have $\Delta \mathrm{AIC} = 0$ , and worse models will have increasing $\Delta \mathrm{AIC}$ . Although not hard and fast, a reasonable guideline is to consider models where $\Delta \mathrm{AIC} < 2$ as having strong support within the model set, and models with $\Delta \mathrm{AIC} < 10$ have some support (Anderson and Burnham 2004). Finally, Akaike weights can be interpreted as the probability, conditional on the set of models under evaluation, that a particular model is the best one.

Here, the full model is best, indicating that all of our variables are useful in describing species richness. Perhaps we could get away with dropping hii, but we have lots of data, so it is not really necessary. We can proceed from here to some visualisation and interpretation.

4. Interpretation

First, a last note about the Poisson assumption. At the beginning, we noted that for our Poisson model to meet its parametric assumption, the conditional mean and variance must be equal. We can check this using the ratio of the residual deviance to the fitted degrees of freedom. This is known as the dispersion parameter and for Poisson models it should be near one.

round(deviance(mod_f) / mod_f$df.residual, 3)

## [1] 0.591

Slightly less than one, so we can at least assume the dispersion of our model is appropriate.

From here, this section is deliberately short. I first provide some code for producing partial a partial response plot, but I encourage you to explore the other variables on your own. What can we learn about the pattern of mammal species richness? Are any results surprising?

We create a partial response plot for the model using dummy data, as we have done before. Note that we must set each variable to a value in this case, because we have not centred the data to mean=0. We also stick to the geographic centre for interpeting the variables. However, it might also be useful to explore the geographic pattern while holding the other variables constant. How would you do this?

pl_data = data.frame(
    bio1 = rep(mean(richness_df$bio1), 200),
    bio4 = mean(richness_df$bio4),
    bio12 = mean(richness_df$bio12),
    bio15 = mean(richness_df$bio15),
    elevation = mean(richness_df$elevation),
    hii = mean(richness_df$hii),
    x = (max(richness_df$x) - min(richness_df$x)) / 2 + min(richness_df$x),
    y = (max(richness_df$y) - min(richness_df$y)) / 2 + min(richness_df$y)
)

## set up a gradient in a single variable of interest
pl_data$elevation = seq(min(richness_df$elevation), max(richness_df$elevation), length.out = nrow(pl_data))

## generate predictions
pl_data$richness = predict(mod_f, newdata = pl_data, type='response')

## plot it
ggplot(pl_data) + geom_line(aes(x=elevation, y = richness)) + xlab("Elevation") + ylab("Mammal Species Richness")

## don't forget to set the variable back to it's original value before looking at other variables
pl_data$elevation = mean(richness_df$elevation)

Finally, here we create a map of species richness, but using only endangered species. If you re-run the analysis, but using endangered species instead of all, how do the species richness patterns change?

## produces one map per status
mamm_r_end = rasterize(mamm, clim_pres[[1]], fun = "sum", background = 0, by = "category")

## adds together near-threatened, vulnerable, endangered, and critically endangered
mamm_r_end = mamm_r_end[["NT"]] + mamm_r_end[["VU"]] + mamm_r_end[["EN"]] + mamm_r_end[["CR"]]

## make sure oceans have zero instead of NA
mamm_r_end = mask(mamm_r_end, mamm_r)

## from here, you could repeat the analysis starting from 2.3

References

Anderson, D., and K. Burnham. “Model selection and multi-model inference.” Second. NY: Springer-Verlag 63.2020 (2004): 10.

CC-BY 4.0