Fitting an SDM

First some preparation. Today’s exercise requires a few more packages, so you will want to run these lines if you don’t have the packages installed yet.

install.packages(c("remotes", "rnaturalearth", "rnaturalearthdata", "scico", "corrplot", "PresenceAbsence", "gridExtra", "reshape2"))

Now we load required packages.

library(sf)
library(terra)
library(tidyterra)
library(rnaturalearth)
library(scico)
library(corrplot)
library(ggplot2)
library(PresenceAbsence)
library(gridExtra)

Introduction

Today we will build some (relatively) simple species distribution models using the shrew dataset from the previous exercise. As we have discussed, SDMs are a popular method for using known biodiversity information (e.g., species occurrences) in combination with maps of potential niche axes and niche proxies (e.g., climate, soil, land use) to predict how likely a species is to be present in places that have not been sampled. We can break the task of fitting a model into five steps:

Conceptualisation
Data preparation
Model fitting
Model assessment
Prediction

1. Conceptualisation

We have already done some data preparation in the previous exercise, but we should take a moment to think about our model and our modelling goals. Important questions to consider include:

What are our research objectives?
Do data already exist? Do we need permission to obtain them?
If we collect our own data, what is an appropriate sampling design?
Are assumptions met? Is the species (roughly) in equilibrium with climate?
What environmental predictors are appropriate to include?

We should also consider the appropriate spatial extent for modelling. In the case of the alpine shrew, we probably don’t care about the entire globe, in which case we should crop the climate data to include our area of interest.

As a motivating example for this study, perhaps we want to know what will happen to the range size of our shrew following climate change. This clarifies our modelling approach: find the shrew range using contemporary climate, use the model to predict the present and future ranges, and compute the size of each.

2. Data preparation

We begin with the data already prepared from the previous exercise, with some important additions. Today we will start by fitting a presence-absence SDM, however our gbif-sourced data is presence-only. For the purposes of this exercise, I have augmented this dataset with pseudo absences. There is an extensive literature on how to select pseudo absences for SDMs, so before doing this for your own models, I strongly recommend reading up. Good starting references are Elith and Leathwick (2009), Franklin (2010), and especially Barbet‐Massin et al (2012).

Here we load two datasets (available from me directly): the cleaned presence-only dataset we created last time, and the augmented presence-absence dataset. Both are tables of class sf, meaning they contain attributes and spatial information.

sorex_po = readRDS("../data/sorex_po.rds")
sorex_pa = readRDS("../data/sorex_pa.rds")

From WorldClim, I have prepared climate layers for recent climate, as well as a single scenario for climate change for 2040-2060, assuming a pessimistic — RCP 8.5 — emissions scenario. Here we load a world map again, as well as the current climate, and then map shrew presence and absence on the mean annual temperature.

countries = ne_countries(scale="medium", returnclass = "sf")
clim_pres = rast("../data/bioclim_pres.grd")

# there are some problems with the natural earth geometry, so we fix them
# then we re-project the raster dataset to match countries
# then we crop the dataset
countries = st_make_valid(countries)
clim_pres = project(clim_pres, countries)
countries = st_crop(countries, clim_pres)


ggplot() + geom_spatraster(data = clim_pres, aes(fill=bio1)) + 
    scale_fill_scico(palette = "lajolla", direction = -1, na.value = "transparent") + 
    geom_sf(data = countries, colour='black', fill=NA) + 
    labs(fill="Mean Annual\nTemperature") + 
    geom_sf(data = sorex_pa, aes(colour = presence), size=0.4) + 
    scale_colour_manual(values = c( "#cab2d6", "#1f78b4"))

## <SpatRaster> resampled to 501200 cells.

2.1 Extracting the climate data

As before, we must intersect the point occurrences and the climate data. This will give us the presence-absence dataset, but with all of the climate variables. There are a few alignment issues that result in a few NAs, we remove those.

sorex_pa = cbind(extract(clim_pres, sorex_pa), sorex_pa)
# drop NAs
i = which(!is.na(sorex_pa$bio1))
sorex_pa = sorex_pa[i,]
head(sorex_pa)

##   ID     bio1    bio10       bio11 bio12 bio13 bio14    bio15 bio16 bio17 bio18
## 1  1 7.076167 14.82267 -0.08000004  1388   139    91 12.62562   390   307   355
## 2  2 8.509167 17.15933 -0.45066670  1324   178    69 33.43997   503   235   503
## 3  3 8.509167 17.15933 -0.45066670  1324   178    69 33.43997   503   235   503
## 4  4 7.102000 15.44200 -1.27866673  1470   196    83 29.16104   535   285   535
## 5  5 4.145833 12.20800 -3.70466661  1331   167    61 30.86435   474   210   474
## 6  6 4.145833 12.20800 -3.70466661  1331   167    61 30.86435   474   210   474
##   bio19     bio2     bio3     bio4   bio5   bio6   bio7       bio8        bio9
## 1   373 8.149667 33.51014 612.2963 20.472 -3.848 24.320  0.8626666 11.67000008
## 2   237 9.561666 33.58742 716.4995 23.312 -5.156 28.468 17.1593342  1.08733320
## 3   237 9.561666 33.58742 716.4995 23.312 -5.156 28.468 17.1593342  1.08733320
## 4   292 8.844666 33.10130 683.5980 21.260 -5.460 26.720 15.4419994 -0.03800005
## 5   210 9.102333 34.72053 653.9284 17.952 -8.264 26.216 12.2080002 -3.70466661
## 6   210 9.102333 34.72053 653.9284 17.952 -8.264 26.216 12.2080002 -3.70466661
##   presence                  geometry
## 1  present POINT (8.109648 48.12057)
## 2  present POINT (13.14806 47.65534)
## 3  present       POINT (13.15 47.66)
## 4  present       POINT (13.66 47.68)
## 5  present       POINT (14.89 47.08)
## 6  present       POINT (14.89 47.08)

2.2 Standardisation

For many types of models, it is a good idea to standardise (or scale) our predictor variables. This prevents numerical issues during model fitting. Generally we do this for each variable, by subtracting the mean and dividing by the standard deviation (sometimes this is called a z score). The tradeoff to doing this is a bit of complexity: you must remember to save the scaling for each variable, so that you can make predictions after model fitting. Note that the scale function exists to do this transformation automatically for you, but we do it manually in order to better understand the transformation and to save it for later.

## get predictor column numbers that have 'bio' in the name
(colnames = grep('bio', colnames(sorex_pa), value = TRUE))

##  [1] "bio1"  "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17"
## [10] "bio18" "bio19" "bio2"  "bio3"  "bio4"  "bio5"  "bio6"  "bio7"  "bio8" 
## [19] "bio9"

# compute mean and standard deviation of each variable
scaling = list(
    center = sapply(colnames, function(i) mean(sorex_pa[[i]])),
    scale = sapply(colnames, function(i) sd(sorex_pa[[i]]))
)

# apply the scaling by subtracting the mean and dividing by sd
sorex_pa_sc = sorex_pa
for(v in colnames)
    sorex_pa_sc[[v]] = (sorex_pa_sc[[v]] - scaling$center[v]) /  scaling$scale[v]

# verify that all columns have mean 0 and sd 1
round(sapply(colnames, function(i) mean(sorex_pa_sc[[i]])),4)

##  bio1 bio10 bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18 bio19  bio2  bio3 
##     0     0     0     0     0     0     0     0     0     0     0     0     0 
##  bio4  bio5  bio6  bio7  bio8  bio9 
##     0     0     0     0     0     0

sapply(colnames, function(i) sd(sorex_pa_sc[[i]]))

##  bio1 bio10 bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18 bio19  bio2  bio3 
##     1     1     1     1     1     1     1     1     1     1     1     1     1 
##  bio4  bio5  bio6  bio7  bio8  bio9 
##     1     1     1     1     1     1

3 Model Fitting

During the conceptualisation phase, we decided on our modelling goals, now we must choose a model type and select predictor variables that allow us to achieve those goals. We will start by fitting a generalised linear model. GLMs are a common and flexible category of parametric models; simple linear regression, multiple regression, logistic regression, and analysis of variance are all special cases of the GLM. Two important concepts to be aware of for GLMs are:

The liklihood distribution (called the family in R), determined by the statistical distribution producing our observations (in this case, either presence or absence).
The link function, which maps a linear regression equation onto a (often nonlinear) function that makes sense for the parameters of the likihood distribution.

Mathematically, we can think about a parameter of our distribution \(\theta\), which we fit to a linear model using the link function \(L\):

\(\theta = L(a + bx)\)

Often \(\theta\) tells us something about the average or expectation of our y-variable. In our case, we have binary outcomes: the y-variable in our model can either be present or absent. We choose a binomial likelihood to match these outcomes, and a logistic link function that will turn the linear regression equation into a sigmoid describing the probability that the species is present.

x = seq(-1,1, length.out=100) ## hypothetical predictor
theta = plogis(0 + 5 * x) ## hypothetical regression equation with an intercept of 0 and slope of 5
plot(x, theta, type='l')

3.1 Simple GLM

Let’s start with a model using only one predictor, the mean annual temperature.

sorex_pa_sc$presence = factor(sorex_pa_sc$presence) ## convert the variable into something glm understands
## present == 1, absent == 0
mod1 = glm(presence ~ bio1, data = sorex_pa_sc, family='binomial') ## the logit link is the default for binomial
summary(mod1)

## 
## Call:
## glm(formula = presence ~ bio1, family = "binomial", data = sorex_pa_sc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4422     0.1352  -18.06   <2e-16 ***
## bio1         -2.5074     0.1652  -15.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1296.57  on 1277  degrees of freedom
## Residual deviance:  789.19  on 1276  degrees of freedom
## AIC: 793.19
## 
## Number of Fisher Scoring iterations: 6

We can see there is a strong negative linear relationship between temperature and the presence/absence of the shrew. What does it look like? We can make a figure.

plot_data = data.frame(bio1 = seq(-4, 3, length.out = 100)) ## create a dummy x-variable for the plot
## create a second variable with the scaling removed so we can see the plot on the original scale
plot_data$bio1_orig = (plot_data$bio1 * scaling$scale['bio1']) + scaling$center['bio1']

## predict can produce predictions for a model automatically
plot_data$prob = predict(mod1, newdata = plot_data, type='response')

plot(plot_data$bio1_orig, plot_data$prob, type='l', xlab = "Mean Annual Temperature (°C)", ylab = "probability of presence")

## here we do it manually so you see what predict is doing
B = coef(mod1) ## get the intercept and slope
plot_data$lp = B[1] + B[2] * plot_data$bio1 ## compute the linear predictor
plot_data$prob_manual = plogis(plot_data$lp)
all.equal(plot_data$prob, plot_data$prob_manual)

## [1] TRUE

The plot seems plausible. However, many species’ niches have a hump shaped relationship with climate. How could we fit that?

mod2 = glm(presence ~ bio1 + I(bio1^2), data = sorex_pa_sc, family='binomial') ## the logit link is the default for binomial
summary(mod2)

## 
## Call:
## glm(formula = presence ~ bio1 + I(bio1^2), family = "binomial", 
##     data = sorex_pa_sc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.5543     0.1599 -15.976   <2e-16 ***
## bio1         -3.1854     0.3555  -8.959   <2e-16 ***
## I(bio1^2)    -0.4574     0.1883  -2.429   0.0151 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1296.57  on 1277  degrees of freedom
## Residual deviance:  783.28  on 1275  degrees of freedom
## AIC: 789.28
## 
## Number of Fisher Scoring iterations: 8

plot_data$prob2 = predict(mod2, newdata = plot_data, type='response')

par(mfrow = c(1,2))
cols = scales::hue_pal()(2)
ggplot(reshape2::melt(plot_data, measure.vars = c('prob', 'prob2')), aes(x = bio1_orig, y = value, colour=variable)) + geom_line() + xlab("Mean Annual Temperature (°C)")

We can see that the model with a quadratic term is significant, but it hasn’t changed the shape of the curve much (at least not for this temperature range); the biggest difference is that at very cold temperatures suitability starts to decline again.

3.2 Collinearity and variable selection

Now we would like to fit a model with multiple variables. However, many environmental data sets have the problem that the predictors are highly correlated with one another. We always must investigate this, and take steps to avoid including highly correlated predictors in our models. We use the corrplot package to visualise the correlations.

## first we create a correlation matrix; use spearman for safety
## we first extract just the predictors and then drop the geometry
i = which(colnames(sorex_pa_sc) %in% c("presence", "ID", "geometry"))
predictors = sorex_pa_sc[, -i]

cor_mat = cor(predictors, method='spearman', use = "complete.obs")
corrplot.mixed(cor_mat, tl.pos='lt', tl.cex=0.6, number.cex=0.5, addCoefasPercent=T)

We can see many highly correlated variables. A common threshold for avoiding multicollinearity in species distribution models is to only include variables with an absolute correlation < 0.7; more conservatively, we could use 0.5. An alternative method is to fit models using variance inflation factors (see your favorite statistics text for this). We will take the first approach, and we will choose among correlated predictors based on modelling goals and some a-priori assumptions about what makes most sense for the organism we are modelling.

The list of bioclim variables is:

bio1 = Annual Mean Temperature
bio2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
bio3 = Isothermality (BIO2/BIO7) (* 100)
bio4 = Temperature Seasonality (standard deviation *100)
bio5 = Max Temperature of Warmest Month
bio6 = Min Temperature of Coldest Month
bio7 = Temperature Annual Range (BIO5-BIO6)
bio8 = Mean Temperature of Wettest Quarter
bio9 = Mean Temperature of Driest Quarter
bio10 = Mean Temperature of Warmest Quarter
bio11 = Mean Temperature of Coldest Quarter
bio12 = Annual Precipitation
bio13 = Precipitation of Wettest Month
bio14 = Precipitation of Driest Month
bio15 = Precipitation Seasonality (Coefficient of Variation)
bio16 = Precipitation of Wettest Quarter
bio17 = Precipitation of Driest Quarter
bio18 = Precipitation of Warmest Quarter
bio19 = Precipitation of Coldest Quarter

First, we start with bio1, the mean annual temperature. This is a nice measure of the overall warmness of the climate. We can eliminate any variable with a absolute correlation >= 0.5:

thresh = 0.5
chosen = 'bio1'
## what is left
(remaining = names(which(abs(cor_mat['bio1',]) < thresh)))

## [1] "bio12" "bio13" "bio15" "bio16" "bio19" "bio2"  "bio4"  "bio7"  "bio8"

We would like a precipitation measure as well; I propose annual precipitation as being most relevant for the shrew.

chosen = c(chosen, 'bio12')
(remaining = names(which(abs(cor_mat['bio12',remaining]) < thresh)))

## [1] "bio15" "bio2"  "bio4"

Now that we have estimates of location (i.e., means or totals), we could also examine climatic variability. For temperature, we choose bio4, seasonality, for precipitation we go with bio15. We can also include bio2, as it is not strongly correlated and measures a different aspect of variability.

chosen = c(chosen, 'bio4')
(remaining = names(which(abs(cor_mat['bio4',remaining]) < thresh)))

## [1] "bio15" "bio2"

chosen = c(chosen, 'bio15')
(remaining = names(which(abs(cor_mat['bio15',remaining]) < thresh)))

## [1] "bio2"

(chosen = c(chosen, 'bio2'))

## [1] "bio1"  "bio12" "bio4"  "bio15" "bio2"

Note that there are other approaches; if you don’t care about interpreting the variables, you can quantitatively assess variable importance using AIC with univariate models (Dormann et al. 2013). However, in most cases, it’s a good idea to select variables that make sense or that test hypotheses that you as a modeller care about.

3.3 Fitting a full model

Now we have a set of predictors, and we can fit a model with all of them. Because many niches are hump-shaped, we also include quadratic terms for all variables in the full model. Later, we will try dropping variables to find what is known as a minimal adequate model.

full_mod = glm(presence ~ bio1 + bio2 + bio4 + bio12 + bio15 + 
                I(bio1^2) + I(bio2^2) + I(bio4^2) + I(bio12^2) + I(bio15^2),
               data = sorex_pa_sc, family='binomial')
summary(full_mod)

## 
## Call:
## glm(formula = presence ~ bio1 + bio2 + bio4 + bio12 + bio15 + 
##     I(bio1^2) + I(bio2^2) + I(bio4^2) + I(bio12^2) + I(bio15^2), 
##     family = "binomial", data = sorex_pa_sc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.76762    0.29108  -6.073 1.26e-09 ***
## bio1        -2.22570    0.35337  -6.298 3.01e-10 ***
## bio2         1.08655    0.27159   4.001 6.32e-05 ***
## bio4         0.05688    0.44630   0.127  0.89858    
## bio12        2.45852    0.37232   6.603 4.02e-11 ***
## bio15        0.26897    0.22309   1.206  0.22794    
## I(bio1^2)   -0.19156    0.19782  -0.968  0.33287    
## I(bio2^2)   -0.41563    0.19668  -2.113  0.03458 *  
## I(bio4^2)   -1.44826    0.51663  -2.803  0.00506 ** 
## I(bio12^2)  -0.41557    0.19296  -2.154  0.03127 *  
## I(bio15^2)  -0.20846    0.19773  -1.054  0.29175    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1296.57  on 1277  degrees of freedom
## Residual deviance:  378.49  on 1267  degrees of freedom
## AIC: 400.49
## 
## Number of Fisher Scoring iterations: 9

3.4 Reducing the model

We can see that many variables have significant relationships, but many do not. Moreover, models with many variables are often quite flexible, and we do not want to over-predict. We can use the step function to iteratively reduce our model, using AIC as the criterion for selection (i.e., if dropping a term reduces AIC, then the term will be dropped).

reduced_mod <- step(full_mod, trace=0)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(reduced_mod)

## 
## Call:
## glm(formula = presence ~ bio1 + bio2 + bio12 + I(bio2^2) + I(bio4^2) + 
##     I(bio12^2) + I(bio15^2), family = "binomial", data = sorex_pa_sc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6831     0.2703  -6.227 4.75e-10 ***
## bio1         -2.0368     0.2619  -7.778 7.34e-15 ***
## bio2          1.2526     0.2475   5.061 4.17e-07 ***
## bio12         2.3649     0.3497   6.763 1.35e-11 ***
## I(bio2^2)    -0.5164     0.1809  -2.854  0.00432 ** 
## I(bio4^2)    -1.5658     0.5295  -2.957  0.00311 ** 
## I(bio12^2)   -0.3822     0.1956  -1.954  0.05072 .  
## I(bio15^2)   -0.3388     0.1773  -1.910  0.05608 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1296.57  on 1277  degrees of freedom
## Residual deviance:  381.02  on 1270  degrees of freedom
## AIC: 397.02
## 
## Number of Fisher Scoring iterations: 9

However, step is not very smart about dropping terms. In particular, step will happily eliminate linear terms while leaving the quadratic term, which is bad practise. So we must either drop the quadratic terms for bio4 and bio15, or we must add the linear terms back in. We will use the AIC difference to decide which is better; a positive number indicates an improvement.

## adding bio4 back in
AIC(reduced_mod) - AIC(update(reduced_mod, ~.+bio4))

## [1] -1.661996

## dropping the quadratic term
AIC(reduced_mod) - AIC(update(reduced_mod, ~.-I(bio4^2)))

## [1] -19.99535

reduced_mod = update(reduced_mod, ~.+bio4)

For bio4, adding the linear term makes the model slightly worse, but dropping the quadratic term is MUCH worse, so we add the linear term back.

## adding bio15 back in
AIC(reduced_mod) - AIC(update(reduced_mod, ~.+bio15))

## [1] -0.7539151

## dropping the quadratic term
AIC(reduced_mod) - AIC(update(reduced_mod, ~.-I(bio15^2)))

## [1] -0.7874426

reduced_mod = update(reduced_mod, ~.-I(bio15^2))

We see with bio15 that we could either add the lienar term or drop the quadratic one, the effect is very similar. Removing the quadratic term is slightly worse, but it also keeps the model simpler, so we do this.

Now our model looks like this. Note that this model is quite large, and contains curvilinear terms; thus it is not so easy to understand just from the parameters how the range of our shrew relates to each predictor.

summary(reduced_mod)

## 
## Call:
## glm(formula = presence ~ bio1 + bio2 + bio12 + I(bio2^2) + I(bio4^2) + 
##     I(bio12^2) + bio4, family = "binomial", data = sorex_pa_sc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9799     0.2538  -7.802 6.11e-15 ***
## bio1         -2.1499     0.2572  -8.358  < 2e-16 ***
## bio2          1.2166     0.2658   4.578 4.70e-06 ***
## bio12         2.4436     0.3722   6.565 5.22e-11 ***
## I(bio2^2)    -0.5357     0.1821  -2.942  0.00326 ** 
## I(bio4^2)    -1.7515     0.5491  -3.190  0.00142 ** 
## I(bio12^2)   -0.3631     0.1968  -1.845  0.06507 .  
## bio4          0.4993     0.4222   1.183  0.23698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1296.57  on 1277  degrees of freedom
## Residual deviance:  383.47  on 1270  degrees of freedom
## AIC: 399.47
## 
## Number of Fisher Scoring iterations: 9

4 Model Asessment

Model assessment is a rather large topic, so I only present some selected topics here. In particular, evaluating model performance is key, but we will do very little for this exercise. Rather, see Allouche et al (2006) and Araugo et al (2005).

4.1 Response curves

Previously, we fit a response curve to a single variable, showing how the probability of presence changed as the temperature changed. In the lecture, we have also seen response surfaces with two variables. With more than two variables, visualisation becomes difficult. More commonly, we fit partial response curves; this shows how the probability of presence (or suitability, for presence-only models) changes as a single variable changes, while holding all other variables constant at their mean.

Side note: On suitability vs probability

The binomial GLM models the probability of observing the outcome (in this case, that the species is present) as a function of the predictor variables. However, when we lack true observations of absence, these probabilities cannot be reliably calibrated. Therefore, the predictions of binomial SDMs can only be properly referred to as probabilities if we have true absences, if there is a well-designed sampling protocol with good environmental and geographic coverage, if the predictor variables are restricted to an apprpriate spatial domain, and if the spatial scale of sampling and prediction are comparable and appropriate. In all other cases, model predictions are better referred to as suitabilities instead of probabilities.

## the variables in our model
vars = c("bio1", "bio2", "bio4", "bio12")
nms = c(bio1 = "Annual Mean Temperature", bio2 = "Mean Diurnal Range", bio4 = "Temperature Seasonality", 
        bio12 = "Annual Precipitation")

## create a dummy data frame for predicting
pr_data = data.frame(bio1 = rep(0, 100), bio2=0, bio4=0, bio12=0)
par(mfrow=c(2,2))
for(v in vars) {
    pr_data[[v]] = seq(-3,3, length.out=nrow(pr_data))
    theta = predict(reduced_mod, newdata=pr_data, type='response')
    x = pr_data[[v]] * scaling$scale[v] + scaling$center[v]
    plot(x, theta, type='l', xlab=nms[v], ylab = "suitability")
    pr_data[[v]] = 0
}

These figures give us a picture of the shrew that makes some sense: our mammal likes environments that are cold, have large swings in the daily temperature and are highly seasonal in terms of temperature, and have high and predictable precipitation.

4.2 Choosing a threshold

Now, we want to know how well our model predicts the observations. Different measures are available for quantifying this. A lot of these measures are threshold-dependent. You have probably realised that our model predicts a continuous response, the probability of occurrence/suitability, while our observations are binary. Many performance measures rely on comparisons like “How many presence observations does the model correctly predict as presence”. In order to answer that we first need to convert the continuous probabilities into binary predictions. Different thresholds are introduced in Liu et al. (2005). Most of these are implemented in the PresenceAbsence package in the optimal.thresholds function. Here, we will use the threshold that maximises the sum of sensitivity and specificity:

## prepare the data
thresh_data = data.frame(
    id = 1:nrow(sorex_pa_sc),
    pres = as.integer(sorex_pa_sc[['presence']])-1,
    prob = predict(reduced_mod, type='response')
)
thold = optimal.thresholds(thresh_data, opt.methods=3)[1,2]

## print a confusion matrix using our threshold
cmx(thresh_data, thold)

##          observed
## predicted   1   0
##         1 242  43
##         0  20 973

From here, we could compute sensitivity, specificity, and other measures such as the True Skill Statistic (TSS, Allouche et al 2006) to evaluate our model. The threshold will also be needed for computing range size.

5 Spatio-temporal prediction

Our last step is to make some maps. Recall that we have a climate dataset, clim_pres. We can also load in the future climate.

clim_fut = rast("../data/bioclim_2060.grd")

In order to make maps, we need to take a few steps.

Convert the raster to a data.frame
Re-scale the climate variables, as we did for the predictors.
Compute the suitability and the predicted presence-absence for present and future
Convert the results from step 3 back to a raster (optional; we skip it because ggplot uses the data frame instead)

# 1. data.frame
clim_pres_df = as.data.frame(clim_pres, xy = TRUE)
clim_fut_df = as.data.frame(clim_fut, xy = TRUE)

# 2. Rescale
for(v in names(clim_pres)) {
    clim_pres_df[[v]] = (clim_pres_df[[v]] - scaling$center[v]) / scaling$scale[v]
    clim_fut_df[[v]] = (clim_fut_df[[v]] - scaling$center[v]) / scaling$scale[v]
}

# 3. Predict
clim_pres_df$suitability = predict(reduced_mod, newdata = clim_pres_df, type='response')
clim_fut_df$suitability = predict(reduced_mod, newdata = clim_fut_df, type='response')
clim_pres_df$presence = ifelse(clim_pres_df$suitability > thold, 1, NA)
clim_fut_df$presence = ifelse(clim_fut_df$suitability > thold, 1, NA)

# 4. Convert back to raster
pr_suitability = rast(clim_pres_df[, c('x', 'y', 'suitability')], 
                      type = 'xyz', crs = crs(clim_pres))
fut_suitability = rast(clim_fut_df[, c('x', 'y', 'suitability')], 
                       type = 'xyz', crs = crs(clim_pres))
pr_presence = rast(clim_pres_df[, c('x', 'y', 'presence')], 
                   type = 'xyz', crs = crs(clim_pres))
fut_presence = rast(clim_fut_df[, c('x', 'y', 'presence')], 
                    type = 'xyz', crs = crs(clim_pres))


# 5. Plot
p1 = ggplot() + geom_spatraster(data = pr_suitability, aes(fill=suitability)) + 
    scale_fill_scico(palette = "lapaz", direction = 1, na.value = "transparent") + 
    labs(fill="Suitability") + 
    geom_sf(data = countries, colour='black', fill=NA)

## <SpatRaster> resampled to 501200 cells.

p2 = ggplot() + geom_spatraster(data =fut_suitability, aes(fill=suitability)) + 
    scale_fill_scico(palette = "lapaz", direction = 1, na.value = "transparent") + 
    labs(fill="Suitability") + 
    geom_sf(data = countries, colour='black', fill=NA)

## <SpatRaster> resampled to 501200 cells.

p3 = ggplot() + geom_sf(data = countries, colour=NA, fill="white") + 
    geom_spatraster(data = pr_presence, aes(fill=presence)) + 
    scale_fill_gradient(low = "#b41f1f", high = "#b41f1f", na.value = "transparent") + 
    ggtitle("Current Range") + guides(fill='none') + 
    geom_sf(data = countries, colour='black', fill=NA)

## <SpatRaster> resampled to 501200 cells.

p4 = ggplot() + geom_sf(data = countries, colour=NA, fill="white") + 
    geom_spatraster(data = fut_presence, aes(fill=presence)) + 
    scale_fill_gradient(low = "#b41f1f", high = "#b41f1f", na.value = "transparent") + 
    ggtitle("Current Range") + guides(fill='none') + 
    geom_sf(data = countries, colour='black', fill=NA)

## <SpatRaster> resampled to 501200 cells.

grid.arrange(p1, p2, p3, p4, nrow=2)

Note that our model includes many areas from which the shrew is likely absent (perhaps due to dispersal constraints, historical reasons, etc). So our range maps here could perhaps be better described as maps of potentially suitable habitat, and not a 100% accurate map of the current range. What would we/could we change about our approach to get a model of the actual shrew range?

5.1 Range size

We can now return to our previous function for computing range size to compare present and future range sizes. For this, we will convert our data frame predictions back into a raster.

#' Input: a raster containing a species range
#' NA cells are out of range, all others are in range
range_size = function(r, in_units = "m^2", out_units = "km^2") {
    cs = cellSize(r)
    mcs = mask(cs, r)
    vals = values(mcs)
    rs = units::set_units(sum(vals, na.rm = TRUE), in_units, mode = "standard")
    if(in_units != out_units)
        rs = units::set_units(rs, out_units, mode = "standard")
    rs
}

cat("Present:", range_size(pr_presence), "km^2\n",
    "Future:", range_size(fut_presence), "km^2\n")

## Present: 224860.4 km^2
##  Future: 105230 km^2

We predict by 2060 that more than half of the suitable habitat will be gone.

6 Bonus: other models

Feel free to experiment with other model types. A suggsetion for the presence-only version of the data would be to use the function ?maxent from the dismo package. You could also explore a gam model from the gam package, or RandomForest from the RandomForest package.

References

Allouche, Omri, Asaf Tsoar, and Ronen Kadmon. 2006. “Assessing the Accuracy of Species Distribution Models: Prevalence, Kappa and the True Skill Statistic (Tss).” Journal of Applied Ecology 43: 1223–32.

Araujo, Miguel B., Richard G. Pearson, Wilfried Thuiller, and Markus Erhard. 2005. “Validation of Species-Climate Impact Models Under Climate Change.” Global Change Biology 11: 1504–13.

Barbet‐Massin, M., Jiguet, F., Albert, C.H. and Thuiller, W. (2012), Selecting pseudo‐absences for species distribution models: how, where and how many?. Methods in Ecology and Evolution, 3: 327-338. https://doi.org/10.1111/j.2041-210X.2011.00172.x

Dormann, C. F., J. Elith, S. Bacher, C. Buchmann, G. Carl, G. Carre, J. R. Garcia Marquez, et al. 2013. “Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance.” Ecography 36: 27–46.

Elith, J., and J. R. Leathwick. 2009. “Species Distribution Models: Ecological Explanation and Prediction Across Space and Time.” Annual Review of Ecology, Evolution, and Systematics 40: 677–97.

Franklin, J. 2010. Mapping Species Distributions: Spatial Inference and Prediction. Cambride University Press.

Liu, C., P. M. Berry, T. P. Dawson, and R. G. Pearson. 2005. “Selecting Thresholds of Occurrence in the Prediction of Species Distributions.” Ecography 28: 385–93.

Modified with thanks from Damaris Zurell, CC-BY 4.0