library(sf)
library(terra)
library(ggplot2)
library(rnaturalearth)
library(scico)
library(gridExtra)
Today we will be exploring biodiversity using the IUCN rangemaps that we briefly explored in the first exercise. Note that this dataset includes all species for which the IUCN performs an assessment, and that, being expert-derived rangemaps, the resolution is rather coarse. Thus, our interpretations must be made with care, as they will necessarily represent approximate or potential diversity, rather than diversity as measured during (e.g.) a biodiversity inventory.
First let’s load and visualise the response variable. The mammal
range maps, cropped to the same extent we have had before, are in the
mammal_ranges.rds
file. We can use the
fasterize
function to convert these individual range
polygons into a map of species richness (by counting the number of
polygons overlapping each grid cell). We use the climate data as the
template for our raster.
mamm = readRDS("../data/biodiv/mammal_ranges.rds")
clim_pres = rast("../data/bioclim_pres.grd")
mamm_r = rasterize(mamm, clim_pres[[1]], fun = "sum", background = NA)
# fix a topological error with the countries
countries = st_make_valid(ne_countries(scale="medium", returnclass = "sf"))
# then crop countries to the data range
countries = st_crop(countries, mamm_r)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
mamm_pl = as.data.frame(mamm_r, xy=TRUE)
c_map = ggplot(countries) + geom_sf(colour='black', fill='white') + xlab("") + ylab("")
c_map + geom_tile(data = mamm_pl, aes(x=x, y=y, fill=layer), alpha = 0.8) +
scale_fill_scico(palette="davos") + labs(fill="Mammal Species Richness")
Some patterns in species richness should be immediately apparent, especially the gradient from lowlands to mountains.
Here I will propose a few modelling questions that we can pursue given the data we have with us for this exercise. However, other models are also possible; consider other hypotheses and pursue them following your own interests.
Some starter questions:
We have several distributions we could try for modelling species richness. Two that are commonly used are the Poisson and the lognormal.
Poisson models are useful for count data, especially with small counts (lower than 50). About 50% of our raster cells have fewer than 50 species, so this model is sensible. However, it comes with some assumptions. In particular, we must assume that species richness is conditionally Poisson distributed; this means that, if we hold all of the predictors in our model constant, we get Poisson distributed species richness (meaning that the variance in species richness is equal to the mean). We can check the marginal distribution to start:
sr = mamm_pl[,3]
xx = 0:max(sr)
yy = dpois(xx, mean(sr))
hist(sr, main = "", xlab = "Species Richness", freq=FALSE, ylim=c(0, max(yy)))
lines(density(sr, bw=5), lwd=2, col='red')
lines(xx, yy, lwd=2, col='blue')
legend("topleft", legend=c("empirical", "theoretical",
paste("mean =", round(mean(sr), 2)), paste("var =", round(var(sr), 2))),
lwd=c(2,2,0,0), col=c('red', 'blue', 'white', 'white'))
So marginal species richness is not poisson distributed; the variance is too high and there are too many low-richness sites. Perhaps this will get better with other predictors, but we can at least check the lognormal as well to see if it fits best before adding predictors.
yy = dlnorm(xx, mean(log(sr)), sd(log(sr)))
hist(sr, main = "", xlab = "Species Richness", freq=FALSE)
lines(density(sr, bw=5), lwd=2, col='red')
lines(xx, yy, lwd=2, col='blue')
legend("topleft", legend=c("empirical", "theoretical",
paste("mean =", round(mean(sr), 2)), paste("var =", round(var(sr), 2))),
lwd=c(2,2,0,0), col=c('red', 'blue', 'white', 'white'))
The lognormal appears to be an even worse fit, with too few high-diversity areas and too many low-diversity sites. So we will stick with the Poisson, keeping in mind we should be on the lookout for overdispersion (excess variance in the residuals), indicating that our model assumptions might not be met.
Our main task here is to load in all of the data sources we would like to use, get them into a unified format (same projection, extent and resolution), and then transform them into a data.frame for further analysis.
In addition to the mammal range maps and the climate data from previous exercises, we now have a digital elevation model from the Copernicus land monitoring service and the human influence index.
If you are interested in more “raw” land cover data, you could also check out the Corine Land Cover dataset. Note that the Corine data is quite large (100-m resolution for all Europe), and can be a bit difficult for slower computers.
elev = rast("../data/biodiv/elevation.grd")
hii = rast("../data/biodiv/hii.grd")
elev_pl = as.data.frame(elev, xy=TRUE)
hii_pl = as.data.frame(hii, xy=TRUE)
hii_pl$hii_v2geo = as.numeric(hii_pl$hii_v2geo)
c_map + geom_tile(data = elev_pl, aes(x=x, y=y, fill=dem_global_30sec)) +
scale_fill_viridis_c() + labs(fill = "Elevation (m)")
c_map + geom_tile(data = hii_pl, aes(x=x, y=y, fill=hii_v2geo)) +
scale_fill_scico(palette = "bamako") + labs(fill="Human Influence Index")
The default dataset is quite large; here I provide some code to reduce the resolution for you. I do not run this by default here, but you may be interested in running this code if your computer struggles with the analyses.
## make a template raster
r = rast(ext(clim_pres), resolution = 0.25,crs = crs(clim_pres))
## reduce to a 0.5 degree grid
clim_pres = resample(clim_pres, r)
A new problem is the map projection used. Previously, we have been ok to use latitude and longitude coordinates. However, our analysis now is accumulating species within grid cells; this analysis will be biased because the grid cells vary in area based on latitude (and the species area relationship tells us that larger areas have more species). Thus, we must reproject our species ranges into an equal area projection, recompute the species richness, and then reproject our predictor variables as well.
We will use the European Lambert equal
area conic projection; this projection is recommended by the European
Research Council as an appropriate tool for map-based statistical
analysis. We use the project
function, which can either
operate using the spatial reference directly, or which can be given a
template raster. In the case of a template raster, the output will be
reprojected AND resampled so that the origin, extent, and resolution
match the template.
## First set up a template raster with the right projection. We will use the coarest resolution raster
## out of everything we have, so that everything ends up lined up to the "worst" dataset
res(clim_pres)
## [1] 0.04166667 0.04166667
res(elev)
## [1] 0.008333334 0.008333334
res(hii)
## [1] 0.00833333 0.00833333
## climate is the coarsest, so we project it first
## I specify a resolution manually to preserve square pixels; 4km is about the size of the largest
## pixels in the original dataset
## for the original-scale raster
clim_pres = project(clim_pres, "+init=epsg:3035", res=4000)
## if using the coarse-resolution raster
# clim_pres = projectRaster(clim_pres, crs="+init=epsg:3035", res = 25000)
## the other rasters will use climate as the template
elev = project(elev, clim_pres)
hii = project(hii, clim_pres)
Now we also reproject species ranges, and recompute richness on the new grid.
mamm = st_transform(mamm, crs=crs(clim_pres))
mamm_r = rasterize(mamm, clim_pres[[1]], fun = "sum", background = NA)
## also transform countries for plotting
countries = st_transform(countries, crs=crs(clim_pres))
mamm_pl = as.data.frame(mamm_r, xy=TRUE)
c_map = ggplot(countries) + geom_sf(colour='black', fill='white') + xlab("") + ylab("")
c_map + geom_tile(data = mamm_pl, aes(x=x, y=y, fill=layer), alpha = 0.8) +
scale_fill_scico(palette="davos") + labs(fill="Mammal Species Richness")
The final step, joining the data, is trivial. The rasters are already
aligned and in the same format, so we just use the stack
function to collapse them into a single dataset, and then
rasterToPoints
to put them into a data.frame for
analysis.
richness_df = c(richness = mamm_r, elev = elev, hii = hii, clim_pres)
richness_df = as.data.frame(richness_df, xy=TRUE)
## fix some column names
colnames(richness_df)[grep('layer', colnames(richness_df))] = 'richness'
colnames(richness_df)[grep('dem_global_30sec', colnames(richness_df))] = 'elevation'
colnames(richness_df)[grep('hii_v2geo', colnames(richness_df))] = 'hii'
## remove NAs (caused by differing raster extents in the original data)
richness_df = richness_df[complete.cases(richness_df), ]
richness_df$hii = as.numeric(richness_df$hii)
head(richness_df)
## x y bio1 bio10 bio11 bio12 bio13 bio14
## 4372 5408538 4239204 4.558241 15.64333 -6.190136 657.7363 83.13522 32
## 5276 5380538 4235204 4.713551 15.84915 -5.973746 659.2147 82.23922 32
## 5277 5384538 4235204 4.694183 15.86233 -6.084814 662.7248 82.71275 32
## 5278 5388538 4235204 4.651457 15.74920 -6.042210 663.4534 82.94498 32
## 5279 5392538 4235204 4.614580 15.68862 -6.058820 665.9443 84.08150 32
## 5280 5396538 4235204 4.557180 15.66292 -6.162462 667.8897 84.00000 32
## bio15 bio16 bio17 bio18 bio19 bio2 bio3 bio4
## 4372 32.31588 231.7216 100.1580 223.5308 127.0000 7.095840 22.47617 885.7181
## 5276 31.90178 228.4970 101.2175 214.9783 130.2766 7.313979 22.84487 883.1389
## 5277 31.96600 230.4255 101.9271 217.4255 129.9467 7.349348 22.89976 888.2964
## 5278 32.30125 231.7033 100.9576 219.3894 129.2165 7.303420 22.91375 882.3629
## 5279 32.71056 235.1630 100.8016 222.4225 128.4456 7.273415 22.87676 880.6664
## 5280 32.49564 235.9746 101.8237 223.4712 128.0000 7.286245 22.84390 884.0333
## bio5 bio6 bio7 bio8 bio9 richness elevation hii
## 4372 21.26847 -10.30230 31.57077 14.15963 -1.820078 40 59.77922 19
## 5276 21.61501 -10.40020 32.01521 14.31686 -1.767455 40 23.85699 40
## 5277 21.64711 -10.44650 32.09361 14.32822 -1.779290 40 17.49591 31
## 5278 21.47877 -10.39480 31.87357 14.23267 -1.745999 40 26.96010 23
## 5279 21.41114 -10.38271 31.79385 14.18052 -1.742617 40 35.17223 23
## 5280 21.43798 -10.45639 31.89438 14.15342 -1.811940 40 48.03083 23
The objects created here are available on OLAT. If you aren’t able to run the code above, just download them and read them in as follows, then continue with the code below.
Here we begin our model building process; we will start with a GLM and a single predictor, as we have previously, this time using the Poisson family because of our count data. Let’s start by looking for a diversity gradient with precipitation.
cols = scales::hue_pal()(4)
mod1 = glm(richness ~ bio12, family = "poisson", data = richness_df)
summary(mod1)
##
## Call:
## glm(formula = richness ~ bio12, family = "poisson", data = richness_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.892e+00 7.599e-04 5121.89 <2e-16 ***
## bio12 9.989e-06 9.509e-07 10.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 759618 on 283746 degrees of freedom
## Residual deviance: 759508 on 283745 degrees of freedom
## AIC: 2379194
##
## Number of Fisher Scoring iterations: 4
prdat = data.frame(bio12 = seq(min(richness_df$bio12), max(richness_df$bio12), length.out=100))
prdat$richness = predict(mod1, newdata=prdat, type='response')
pts = ggplot(data = richness_df) + geom_point(aes(x=bio12, y=richness), size = 0.05) +
xlab("Annual Precipitation")
m1_p = pts + geom_line(data = prdat, aes(x=bio12, y=richness), size = 1.4, colour = cols[1])
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
m1_p
This model certainly seems inadequate. There is very high scatter around the line, and the relationship seems highly nonlinear. Interestingly, we see an decrease in richness with precipitation.
We could continue as before, fitting polynomials to our data, but instead perhaps it makes sense to fit a different kind of model.
Generalised additive models are an extension of GLMs If a typical GLM fits a linear equation:
then a GAM extends this to fitting non-linear (smooth) functions of each x-variable:
Fitting a gam in R is simple.
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-5
mod2 = gam(richness ~ s(bio12), family = "poisson", data = richness_df)
summary(mod2)
##
## Call: gam(formula = richness ~ s(bio12), family = "poisson", data = richness_df)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.6212 -0.7910 0.2464 0.9667 6.6274
##
## (Dispersion Parameter for poisson family taken to be 1)
##
## Null Deviance: 759618.3 on 283746 degrees of freedom
## Residual Deviance: 691939.2 on 283742 degrees of freedom
## AIC: 2311631
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(bio12) 1 301 300.897 136.38 < 2.2e-16 ***
## Residuals 283742 626023 2.206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar Chisq P(Chi)
## (Intercept)
## s(bio12) 3 64030 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
prdat$richness2 = predict(mod2, newdata=prdat, type='response')
m2_p = m1_p + geom_line(data = prdat, aes(x=bio12, y=richness2), size = 1.4, colour = cols[3])
m2_p
The GAM produces a much curvier fit, and one that shows a hump in richness at intermediate precipitations. It certainly seems to be a better match to the cloud of points, but there is still a lot of scatter.
Note the s
function in the GAM call. This stands for
“smooth.” By specifying this, we are asking for a smooth curve to be fit
to a variable. GAM can also fit linear terms, by including them as we do
in a GLM. Also note the df
parameter; this controls how
curvy the line can be; if df=1, then the fit would be linear.
We notice a lot of empty space on the above plot, an in particular a lot of strange gaps. Some precipitation values are not well-represented in the data, and there are some precip-richness combinations that just don’t exist. Thus, there could be a lot of things not related to precipitation that influence how this relationship looks. In particular, because we analyse diversity at every point on the map, we must be very careful in interpreting the results. Certainly p-values are biased and not to be trusted, but even the slope and/or shape of the relationship could be influenced quite a lot by confounding factors.
Naturally, we can account for this by including every important factor in our model. However, this presents at least two problems. First, it is certainly impossible to measure all important factors controlling diversity. Second, some of the factors at this scale are certainly related directly to geography, notably biogeographic explanations for species richness. Our study area includes numerous islands, deserts, and mountain ranges that present significant barriers to the movement of species, so we must consider geography in our analysis. There are a number of ways of fitting spatial models, here we consider a simple one using the GAM smoother. We use a high number of degrees of freedom to allow for lots of geographic contingency, and we include both terms in a single smoother to approximate a spline surface rather than a curve. When we visualise the curve against precipitation, we do it for the geographic centre of our study area
mod3 = gam(richness ~ s(bio12) + s(x, df=5) + s(y, df=5) + s(x*y, df=5), family = "poisson", data = richness_df)
summary(mod3)
##
## Call: gam(formula = richness ~ s(bio12) + s(x, df = 5) + s(y, df = 5) +
## s(x * y, df = 5), family = "poisson", data = richness_df)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10.23808 -0.33074 0.07579 0.40144 2.53041
##
## (Dispersion Parameter for poisson family taken to be 1)
##
## Null Deviance: 759618.3 on 283746 degrees of freedom
## Residual Deviance: 190368.3 on 283727 degrees of freedom
## AIC: 1810090
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(bio12) 1 20816 20816 35124 < 2.2e-16 ***
## s(x, df = 5) 1 52267 52267 88194 < 2.2e-16 ***
## s(y, df = 5) 1 38669 38669 65250 < 2.2e-16 ***
## s(x * y, df = 5) 1 129707 129707 218863 < 2.2e-16 ***
## Residuals 283727 168148 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar Chisq P(Chi)
## (Intercept)
## s(bio12) 3 6162 < 2.2e-16 ***
## s(x, df = 5) 4 26540 < 2.2e-16 ***
## s(y, df = 5) 4 235145 < 2.2e-16 ***
## s(x * y, df = 5) 4 4934 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
prdat$x = (max(richness_df$x) - min(richness_df$x)) / 2 + min(richness_df$x)
prdat$y = (max(richness_df$y) - min(richness_df$y)) / 2 + min(richness_df$y)
prdat$richness3 = predict(mod3, newdata=prdat, type='response')
m3_p = m2_p + geom_line(data = prdat, aes(x=bio12, y=richness3), size = 1.4, colour = cols[4])
m3_p
Our new curve is looking much more sensible, richness generally increases with precipitation until it levels off. This suggests the decrease we saw with ther earlier models is due to other variables. Now we will try including them.
Now comes the task of deciding which variables to include. Fortunately, AIC is also well-defined for GAM models, so we can use this as the basis of our comparison. Here we fit a full model, considering elevation, human influence, temperature, precipitation, temperature seasonality, and precipitation seasonality. We then follow this with a series of models dropping a single parameter at a time (not including space, we always retain this), and compare them all in a table using AIC.
Note that these are big models, so it is normal for this code to take some time to run.
mod_f = gam(richness ~ s(bio1) + s(bio4) + s(bio12) + s(bio15) + s(elevation) + s(hii) +
s(x, df=5) + s(y, df=5) + s(x*y, df=5), family = "poisson", data = richness_df)
mod_dbio1 = update(mod_f, ~.-s(bio1))
mod_dbio4 = update(mod_f, ~.-s(bio4))
mod_dbio12 = update(mod_f, ~.-s(bio12))
mod_dbio15 = update(mod_f, ~.-s(bio15))
mod_delevation = update(mod_f, ~.-s(elevation))
mod_dhii = update(mod_f, ~.-s(hii))
aic_tab = data.frame(names = c("full", "-bio1", "-bio4", "-bio12", "-bio15", "-elevation", "-hii"),
aic = c(AIC(mod_f), AIC(mod_dbio1), AIC(mod_dbio4), AIC(mod_dbio12), AIC(mod_dbio15), AIC(mod_delevation), AIC(mod_dhii)))
aic_tab$daic = aic_tab$aic - min(aic_tab$aic)
aicw_num = exp(-0.5 * aic_tab$daic)
aic_tab$weight = aicw_num / sum(aicw_num)
knitr::kable(aic_tab, digits = 3)
names | aic | daic | weight |
---|---|---|---|
full | 1787355 | 0.000 | 1 |
-bio1 | 1789873 | 2518.019 | 0 |
-bio4 | 1795769 | 8414.134 | 0 |
-bio12 | 1790064 | 2709.369 | 0 |
-bio15 | 1789806 | 2450.888 | 0 |
-elevation | 1788118 | 762.607 | 0 |
-hii | 1787390 | 34.675 | 0 |
Here we compute AIC for each model, along with and the Akaike weight. AIC alone is related to the log likelihood of the model, after penalising for model complexity; a lower AIC indicates a better-fitting model. can be used to compare models within a single set. Here, the best model in the set will always have , and worse models will have increasing . Although not hard and fast, a reasonable guideline is to consider models where as having strong support within the model set, and models with have some support (Anderson and Burnham 2004). Finally, Akaike weights can be interpreted as the probability, conditional on the set of models under evaluation, that a particular model is the best one.
Here, the full model is best, indicating that all of our variables
are useful in describing species richness. Perhaps we could get away
with dropping hii
, but we have lots of data, so it is not
really necessary. We can proceed from here to some visualisation and
interpretation.
First, a last note about the Poisson assumption. At the beginning, we noted that for our Poisson model to meet its parametric assumption, the conditional mean and variance must be equal. We can check this using the ratio of the residual deviance to the fitted degrees of freedom. This is known as the dispersion parameter and for Poisson models it should be near one.
round(deviance(mod_f) / mod_f$df.residual, 3)
## [1] 0.591
Slightly less than one, so we can at least assume the dispersion of our model is appropriate.
From here, this section is deliberately short. I first provide some code for producing partial a partial response plot, but I encourage you to explore the other variables on your own. What can we learn about the pattern of mammal species richness? Are any results surprising?
We create a partial response plot for the model using dummy data, as we have done before. Note that we must set each variable to a value in this case, because we have not centred the data to mean=0. We also stick to the geographic centre for interpeting the variables. However, it might also be useful to explore the geographic pattern while holding the other variables constant. How would you do this?
pl_data = data.frame(
bio1 = rep(mean(richness_df$bio1), 200),
bio4 = mean(richness_df$bio4),
bio12 = mean(richness_df$bio12),
bio15 = mean(richness_df$bio15),
elevation = mean(richness_df$elevation),
hii = mean(richness_df$hii),
x = (max(richness_df$x) - min(richness_df$x)) / 2 + min(richness_df$x),
y = (max(richness_df$y) - min(richness_df$y)) / 2 + min(richness_df$y)
)
## set up a gradient in a single variable of interest
pl_data$elevation = seq(min(richness_df$elevation), max(richness_df$elevation), length.out = nrow(pl_data))
## generate predictions
pl_data$richness = predict(mod_f, newdata = pl_data, type='response')
## plot it
ggplot(pl_data) + geom_line(aes(x=elevation, y = richness)) + xlab("Elevation") + ylab("Mammal Species Richness")
## don't forget to set the variable back to it's original value before looking at other variables
pl_data$elevation = mean(richness_df$elevation)
Finally, here we create a map of species richness, but using only endangered species. If you re-run the analysis, but using endangered species instead of all, how do the species richness patterns change?
## produces one map per status
mamm_r_end = rasterize(mamm, clim_pres[[1]], fun = "sum", background = 0, by = "category")
## adds together near-threatened, vulnerable, endangered, and critically endangered
mamm_r_end = mamm_r_end[["NT"]] + mamm_r_end[["VU"]] + mamm_r_end[["EN"]] + mamm_r_end[["CR"]]
## make sure oceans have zero instead of NA
mamm_r_end = mask(mamm_r_end, mamm_r)
## from here, you could repeat the analysis starting from 2.3
Anderson, D., and K. Burnham. “Model selection and multi-model inference.” Second. NY: Springer-Verlag 63.2020 (2004): 10.