First some preparation. Today’s exercise requires a few more packages, so you will want to run these lines if you don’t have the packages installed yet.
install.packages(c("remotes", "rnaturalearth", "rnaturalearthdata", "scico", "corrplot", "PresenceAbsence", "gridExtra", "reshape2"))
Now we load required packages.
library(sf)
library(terra)
library(tidyterra)
library(rnaturalearth)
library(scico)
library(corrplot)
library(ggplot2)
library(PresenceAbsence)
library(gridExtra)
Today we will build some (relatively) simple species distribution models using the shrew dataset from the previous exercise. As we have discussed, SDMs are a popular method for using known biodiversity information (e.g., species occurrences) in combination with maps of potential niche axes and niche proxies (e.g., climate, soil, land use) to predict how likely a species is to be present in places that have not been sampled. We can break the task of fitting a model into five steps:
We have already done some data preparation in the previous exercise, but we should take a moment to think about our model and our modelling goals. Important questions to consider include:
We should also consider the appropriate spatial extent for modelling. In the case of the alpine shrew, we probably don’t care about the entire globe, in which case we should crop the climate data to include our area of interest.
As a motivating example for this study, perhaps we want to know what will happen to the range size of our shrew following climate change. This clarifies our modelling approach: find the shrew range using contemporary climate, use the model to predict the present and future ranges, and compute the size of each.
We begin with the data already prepared from the previous exercise, with some important additions. Today we will start by fitting a presence-absence SDM, however our gbif-sourced data is presence-only. For the purposes of this exercise, I have augmented this dataset with pseudo absences. There is an extensive literature on how to select pseudo absences for SDMs, so before doing this for your own models, I strongly recommend reading up. Good starting references are Elith and Leathwick (2009), Franklin (2010), and especially Barbet‐Massin et al (2012).
Here we load two datasets (available from me directly): the cleaned
presence-only dataset we created last time, and the augmented
presence-absence dataset. Both are tables of class sf
,
meaning they contain attributes and spatial information.
sorex_po = readRDS("../data/sorex_po.rds")
sorex_pa = readRDS("../data/sorex_pa.rds")
From WorldClim, I have prepared climate layers for recent climate, as well as a single scenario for climate change for 2040-2060, assuming a pessimistic — RCP 8.5 — emissions scenario. Here we load a world map again, as well as the current climate, and then map shrew presence and absence on the mean annual temperature.
countries = ne_countries(scale="medium", returnclass = "sf")
clim_pres = rast("../data/bioclim_pres.grd")
# there are some problems with the natural earth geometry, so we fix them
# then we re-project the raster dataset to match countries
# then we crop the dataset
countries = st_make_valid(countries)
clim_pres = project(clim_pres, countries)
countries = st_crop(countries, clim_pres)
ggplot() + geom_spatraster(data = clim_pres, aes(fill=bio1)) +
scale_fill_scico(palette = "lajolla", direction = -1, na.value = "transparent") +
geom_sf(data = countries, colour='black', fill=NA) +
labs(fill="Mean Annual\nTemperature") +
geom_sf(data = sorex_pa, aes(colour = presence), size=0.4) +
scale_colour_manual(values = c( "#cab2d6", "#1f78b4"))
## <SpatRaster> resampled to 501200 cells.
As before, we must intersect the point occurrences and the climate data. This will give us the presence-absence dataset, but with all of the climate variables. There are a few alignment issues that result in a few NAs, we remove those.
sorex_pa = cbind(extract(clim_pres, sorex_pa), sorex_pa)
# drop NAs
i = which(!is.na(sorex_pa$bio1))
sorex_pa = sorex_pa[i,]
head(sorex_pa)
## ID bio1 bio10 bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18
## 1 1 7.076167 14.82267 -0.08000004 1388 139 91 12.62562 390 307 355
## 2 2 8.509167 17.15933 -0.45066670 1324 178 69 33.43997 503 235 503
## 3 3 8.509167 17.15933 -0.45066670 1324 178 69 33.43997 503 235 503
## 4 4 7.102000 15.44200 -1.27866673 1470 196 83 29.16104 535 285 535
## 5 5 4.145833 12.20800 -3.70466661 1331 167 61 30.86435 474 210 474
## 6 6 4.145833 12.20800 -3.70466661 1331 167 61 30.86435 474 210 474
## bio19 bio2 bio3 bio4 bio5 bio6 bio7 bio8 bio9
## 1 373 8.149667 33.51014 612.2963 20.472 -3.848 24.320 0.8626666 11.67000008
## 2 237 9.561666 33.58742 716.4995 23.312 -5.156 28.468 17.1593342 1.08733320
## 3 237 9.561666 33.58742 716.4995 23.312 -5.156 28.468 17.1593342 1.08733320
## 4 292 8.844666 33.10130 683.5980 21.260 -5.460 26.720 15.4419994 -0.03800005
## 5 210 9.102333 34.72053 653.9284 17.952 -8.264 26.216 12.2080002 -3.70466661
## 6 210 9.102333 34.72053 653.9284 17.952 -8.264 26.216 12.2080002 -3.70466661
## presence geometry
## 1 present POINT (8.109648 48.12057)
## 2 present POINT (13.14806 47.65534)
## 3 present POINT (13.15 47.66)
## 4 present POINT (13.66 47.68)
## 5 present POINT (14.89 47.08)
## 6 present POINT (14.89 47.08)
For many types of models, it is a good idea to standardise (or scale)
our predictor variables. This prevents numerical issues during model
fitting. Generally we do this for each variable, by subtracting the mean
and dividing by the standard deviation (sometimes this is called a z
score). The tradeoff to doing this is a bit of complexity: you must
remember to save the scaling for each variable, so that you can make
predictions after model fitting. Note that the scale
function exists to do this transformation automatically for you, but we
do it manually in order to better understand the transformation and to
save it for later.
## get predictor column numbers that have 'bio' in the name
(colnames = grep('bio', colnames(sorex_pa), value = TRUE))
## [1] "bio1" "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17"
## [10] "bio18" "bio19" "bio2" "bio3" "bio4" "bio5" "bio6" "bio7" "bio8"
## [19] "bio9"
# compute mean and standard deviation of each variable
scaling = list(
center = sapply(colnames, function(i) mean(sorex_pa[[i]])),
scale = sapply(colnames, function(i) sd(sorex_pa[[i]]))
)
# apply the scaling by subtracting the mean and dividing by sd
sorex_pa_sc = sorex_pa
for(v in colnames)
sorex_pa_sc[[v]] = (sorex_pa_sc[[v]] - scaling$center[v]) / scaling$scale[v]
# verify that all columns have mean 0 and sd 1
round(sapply(colnames, function(i) mean(sorex_pa_sc[[i]])),4)
## bio1 bio10 bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18 bio19 bio2 bio3
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## bio4 bio5 bio6 bio7 bio8 bio9
## 0 0 0 0 0 0
sapply(colnames, function(i) sd(sorex_pa_sc[[i]]))
## bio1 bio10 bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18 bio19 bio2 bio3
## 1 1 1 1 1 1 1 1 1 1 1 1 1
## bio4 bio5 bio6 bio7 bio8 bio9
## 1 1 1 1 1 1
During the conceptualisation phase, we decided on our modelling goals, now we must choose a model type and select predictor variables that allow us to achieve those goals. We will start by fitting a generalised linear model. GLMs are a common and flexible category of parametric models; simple linear regression, multiple regression, logistic regression, and analysis of variance are all special cases of the GLM. Two important concepts to be aware of for GLMs are:
family
in R), determined by the statistical distribution
producing our observations (in this case, either presence or
absence).Mathematically, we can think about a parameter of our distribution \(\theta\), which we fit to a linear model using the link function \(L\):
\(\theta = L(a + bx)\)
Often \(\theta\) tells us something about the average or expectation of our y-variable. In our case, we have binary outcomes: the y-variable in our model can either be present or absent. We choose a binomial likelihood to match these outcomes, and a logistic link function that will turn the linear regression equation into a sigmoid describing the probability that the species is present.
x = seq(-1,1, length.out=100) ## hypothetical predictor
theta = plogis(0 + 5 * x) ## hypothetical regression equation with an intercept of 0 and slope of 5
plot(x, theta, type='l')
Let’s start with a model using only one predictor, the mean annual temperature.
sorex_pa_sc$presence = factor(sorex_pa_sc$presence) ## convert the variable into something glm understands
## present == 1, absent == 0
mod1 = glm(presence ~ bio1, data = sorex_pa_sc, family='binomial') ## the logit link is the default for binomial
summary(mod1)
##
## Call:
## glm(formula = presence ~ bio1, family = "binomial", data = sorex_pa_sc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4422 0.1352 -18.06 <2e-16 ***
## bio1 -2.5074 0.1652 -15.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1296.57 on 1277 degrees of freedom
## Residual deviance: 789.19 on 1276 degrees of freedom
## AIC: 793.19
##
## Number of Fisher Scoring iterations: 6
We can see there is a strong negative linear relationship between temperature and the presence/absence of the shrew. What does it look like? We can make a figure.
plot_data = data.frame(bio1 = seq(-4, 3, length.out = 100)) ## create a dummy x-variable for the plot
## create a second variable with the scaling removed so we can see the plot on the original scale
plot_data$bio1_orig = (plot_data$bio1 * scaling$scale['bio1']) + scaling$center['bio1']
## predict can produce predictions for a model automatically
plot_data$prob = predict(mod1, newdata = plot_data, type='response')
plot(plot_data$bio1_orig, plot_data$prob, type='l', xlab = "Mean Annual Temperature (°C)", ylab = "probability of presence")
## here we do it manually so you see what predict is doing
B = coef(mod1) ## get the intercept and slope
plot_data$lp = B[1] + B[2] * plot_data$bio1 ## compute the linear predictor
plot_data$prob_manual = plogis(plot_data$lp)
all.equal(plot_data$prob, plot_data$prob_manual)
## [1] TRUE
The plot seems plausible. However, many species’ niches have a hump shaped relationship with climate. How could we fit that?
mod2 = glm(presence ~ bio1 + I(bio1^2), data = sorex_pa_sc, family='binomial') ## the logit link is the default for binomial
summary(mod2)
##
## Call:
## glm(formula = presence ~ bio1 + I(bio1^2), family = "binomial",
## data = sorex_pa_sc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5543 0.1599 -15.976 <2e-16 ***
## bio1 -3.1854 0.3555 -8.959 <2e-16 ***
## I(bio1^2) -0.4574 0.1883 -2.429 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1296.57 on 1277 degrees of freedom
## Residual deviance: 783.28 on 1275 degrees of freedom
## AIC: 789.28
##
## Number of Fisher Scoring iterations: 8
plot_data$prob2 = predict(mod2, newdata = plot_data, type='response')
par(mfrow = c(1,2))
cols = scales::hue_pal()(2)
ggplot(reshape2::melt(plot_data, measure.vars = c('prob', 'prob2')), aes(x = bio1_orig, y = value, colour=variable)) + geom_line() + xlab("Mean Annual Temperature (°C)")
We can see that the model with a quadratic term is significant, but it hasn’t changed the shape of the curve much (at least not for this temperature range); the biggest difference is that at very cold temperatures suitability starts to decline again.
Now we would like to fit a model with multiple variables. However,
many environmental data sets have the problem that the predictors are
highly correlated with one another. We always must investigate this, and
take steps to avoid including highly correlated predictors in our
models. We use the corrplot
package to visualise the
correlations.
## first we create a correlation matrix; use spearman for safety
## we first extract just the predictors and then drop the geometry
i = which(colnames(sorex_pa_sc) %in% c("presence", "ID", "geometry"))
predictors = sorex_pa_sc[, -i]
cor_mat = cor(predictors, method='spearman', use = "complete.obs")
corrplot.mixed(cor_mat, tl.pos='lt', tl.cex=0.6, number.cex=0.5, addCoefasPercent=T)
We can see many highly correlated variables. A common threshold for avoiding multicollinearity in species distribution models is to only include variables with an absolute correlation < 0.7; more conservatively, we could use 0.5. An alternative method is to fit models using variance inflation factors (see your favorite statistics text for this). We will take the first approach, and we will choose among correlated predictors based on modelling goals and some a-priori assumptions about what makes most sense for the organism we are modelling.
The list of bioclim variables is:
First, we start with bio1
, the mean annual temperature.
This is a nice measure of the overall warmness of the climate. We can
eliminate any variable with a absolute correlation >= 0.5:
thresh = 0.5
chosen = 'bio1'
## what is left
(remaining = names(which(abs(cor_mat['bio1',]) < thresh)))
## [1] "bio12" "bio13" "bio15" "bio16" "bio19" "bio2" "bio4" "bio7" "bio8"
We would like a precipitation measure as well; I propose annual precipitation as being most relevant for the shrew.
chosen = c(chosen, 'bio12')
(remaining = names(which(abs(cor_mat['bio12',remaining]) < thresh)))
## [1] "bio15" "bio2" "bio4"
Now that we have estimates of location (i.e., means or totals), we
could also examine climatic variability. For temperature, we choose
bio4
, seasonality, for precipitation we go with
bio15
. We can also include bio2
, as it is not
strongly correlated and measures a different aspect of variability.
chosen = c(chosen, 'bio4')
(remaining = names(which(abs(cor_mat['bio4',remaining]) < thresh)))
## [1] "bio15" "bio2"
chosen = c(chosen, 'bio15')
(remaining = names(which(abs(cor_mat['bio15',remaining]) < thresh)))
## [1] "bio2"
(chosen = c(chosen, 'bio2'))
## [1] "bio1" "bio12" "bio4" "bio15" "bio2"
Note that there are other approaches; if you don’t care about interpreting the variables, you can quantitatively assess variable importance using AIC with univariate models (Dormann et al. 2013). However, in most cases, it’s a good idea to select variables that make sense or that test hypotheses that you as a modeller care about.
Now we have a set of predictors, and we can fit a model with all of them. Because many niches are hump-shaped, we also include quadratic terms for all variables in the full model. Later, we will try dropping variables to find what is known as a minimal adequate model.
full_mod = glm(presence ~ bio1 + bio2 + bio4 + bio12 + bio15 +
I(bio1^2) + I(bio2^2) + I(bio4^2) + I(bio12^2) + I(bio15^2),
data = sorex_pa_sc, family='binomial')
summary(full_mod)
##
## Call:
## glm(formula = presence ~ bio1 + bio2 + bio4 + bio12 + bio15 +
## I(bio1^2) + I(bio2^2) + I(bio4^2) + I(bio12^2) + I(bio15^2),
## family = "binomial", data = sorex_pa_sc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.76762 0.29108 -6.073 1.26e-09 ***
## bio1 -2.22570 0.35337 -6.298 3.01e-10 ***
## bio2 1.08655 0.27159 4.001 6.32e-05 ***
## bio4 0.05688 0.44630 0.127 0.89858
## bio12 2.45852 0.37232 6.603 4.02e-11 ***
## bio15 0.26897 0.22309 1.206 0.22794
## I(bio1^2) -0.19156 0.19782 -0.968 0.33287
## I(bio2^2) -0.41563 0.19668 -2.113 0.03458 *
## I(bio4^2) -1.44826 0.51663 -2.803 0.00506 **
## I(bio12^2) -0.41557 0.19296 -2.154 0.03127 *
## I(bio15^2) -0.20846 0.19773 -1.054 0.29175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1296.57 on 1277 degrees of freedom
## Residual deviance: 378.49 on 1267 degrees of freedom
## AIC: 400.49
##
## Number of Fisher Scoring iterations: 9
We can see that many variables have significant relationships, but
many do not. Moreover, models with many variables are often quite
flexible, and we do not want to over-predict. We can use the
step
function to iteratively reduce our model, using AIC as
the criterion for selection (i.e., if dropping a term reduces AIC, then
the term will be dropped).
reduced_mod <- step(full_mod, trace=0)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(reduced_mod)
##
## Call:
## glm(formula = presence ~ bio1 + bio2 + bio12 + I(bio2^2) + I(bio4^2) +
## I(bio12^2) + I(bio15^2), family = "binomial", data = sorex_pa_sc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6831 0.2703 -6.227 4.75e-10 ***
## bio1 -2.0368 0.2619 -7.778 7.34e-15 ***
## bio2 1.2526 0.2475 5.061 4.17e-07 ***
## bio12 2.3649 0.3497 6.763 1.35e-11 ***
## I(bio2^2) -0.5164 0.1809 -2.854 0.00432 **
## I(bio4^2) -1.5658 0.5295 -2.957 0.00311 **
## I(bio12^2) -0.3822 0.1956 -1.954 0.05072 .
## I(bio15^2) -0.3388 0.1773 -1.910 0.05608 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1296.57 on 1277 degrees of freedom
## Residual deviance: 381.02 on 1270 degrees of freedom
## AIC: 397.02
##
## Number of Fisher Scoring iterations: 9
However, step
is not very smart about dropping terms. In
particular, step will happily eliminate linear terms while leaving the
quadratic term, which is bad practise. So we must either drop the
quadratic terms for bio4
and bio15
, or we must
add the linear terms back in. We will use the AIC difference to decide
which is better; a positive number indicates an improvement.
## adding bio4 back in
AIC(reduced_mod) - AIC(update(reduced_mod, ~.+bio4))
## [1] -1.661996
## dropping the quadratic term
AIC(reduced_mod) - AIC(update(reduced_mod, ~.-I(bio4^2)))
## [1] -19.99535
reduced_mod = update(reduced_mod, ~.+bio4)
For bio4
, adding the linear term makes the model
slightly worse, but dropping the quadratic term is MUCH worse, so we add
the linear term back.
## adding bio15 back in
AIC(reduced_mod) - AIC(update(reduced_mod, ~.+bio15))
## [1] -0.7539151
## dropping the quadratic term
AIC(reduced_mod) - AIC(update(reduced_mod, ~.-I(bio15^2)))
## [1] -0.7874426
reduced_mod = update(reduced_mod, ~.-I(bio15^2))
We see with bio15
that we could either add the lienar
term or drop the quadratic one, the effect is very similar. Removing the
quadratic term is slightly worse, but it also keeps the model simpler,
so we do this.
Now our model looks like this. Note that this model is quite large, and contains curvilinear terms; thus it is not so easy to understand just from the parameters how the range of our shrew relates to each predictor.
summary(reduced_mod)
##
## Call:
## glm(formula = presence ~ bio1 + bio2 + bio12 + I(bio2^2) + I(bio4^2) +
## I(bio12^2) + bio4, family = "binomial", data = sorex_pa_sc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9799 0.2538 -7.802 6.11e-15 ***
## bio1 -2.1499 0.2572 -8.358 < 2e-16 ***
## bio2 1.2166 0.2658 4.578 4.70e-06 ***
## bio12 2.4436 0.3722 6.565 5.22e-11 ***
## I(bio2^2) -0.5357 0.1821 -2.942 0.00326 **
## I(bio4^2) -1.7515 0.5491 -3.190 0.00142 **
## I(bio12^2) -0.3631 0.1968 -1.845 0.06507 .
## bio4 0.4993 0.4222 1.183 0.23698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1296.57 on 1277 degrees of freedom
## Residual deviance: 383.47 on 1270 degrees of freedom
## AIC: 399.47
##
## Number of Fisher Scoring iterations: 9
Model assessment is a rather large topic, so I only present some selected topics here. In particular, evaluating model performance is key, but we will do very little for this exercise. Rather, see Allouche et al (2006) and Araugo et al (2005).
Previously, we fit a response curve to a single variable, showing how the probability of presence changed as the temperature changed. In the lecture, we have also seen response surfaces with two variables. With more than two variables, visualisation becomes difficult. More commonly, we fit partial response curves; this shows how the probability of presence (or suitability, for presence-only models) changes as a single variable changes, while holding all other variables constant at their mean.
The binomial GLM models the probability of observing the outcome (in this case, that the species is present) as a function of the predictor variables. However, when we lack true observations of absence, these probabilities cannot be reliably calibrated. Therefore, the predictions of binomial SDMs can only be properly referred to as probabilities if we have true absences, if there is a well-designed sampling protocol with good environmental and geographic coverage, if the predictor variables are restricted to an apprpriate spatial domain, and if the spatial scale of sampling and prediction are comparable and appropriate. In all other cases, model predictions are better referred to as suitabilities instead of probabilities.
## the variables in our model
vars = c("bio1", "bio2", "bio4", "bio12")
nms = c(bio1 = "Annual Mean Temperature", bio2 = "Mean Diurnal Range", bio4 = "Temperature Seasonality",
bio12 = "Annual Precipitation")
## create a dummy data frame for predicting
pr_data = data.frame(bio1 = rep(0, 100), bio2=0, bio4=0, bio12=0)
par(mfrow=c(2,2))
for(v in vars) {
pr_data[[v]] = seq(-3,3, length.out=nrow(pr_data))
theta = predict(reduced_mod, newdata=pr_data, type='response')
x = pr_data[[v]] * scaling$scale[v] + scaling$center[v]
plot(x, theta, type='l', xlab=nms[v], ylab = "suitability")
pr_data[[v]] = 0
}
These figures give us a picture of the shrew that makes some sense: our mammal likes environments that are cold, have large swings in the daily temperature and are highly seasonal in terms of temperature, and have high and predictable precipitation.
Now, we want to know how well our model predicts the observations.
Different measures are available for quantifying this. A lot of these
measures are threshold-dependent. You have probably realised that our
model predicts a continuous response, the probability of
occurrence/suitability, while our observations are binary. Many
performance measures rely on comparisons like “How many presence
observations does the model correctly predict as presence”. In order to
answer that we first need to convert the continuous probabilities into
binary predictions. Different thresholds are introduced in Liu et
al. (2005). Most of these are implemented in the
PresenceAbsence
package in the
optimal.thresholds
function. Here, we will use the
threshold that maximises the sum of sensitivity and specificity:
## prepare the data
thresh_data = data.frame(
id = 1:nrow(sorex_pa_sc),
pres = as.integer(sorex_pa_sc[['presence']])-1,
prob = predict(reduced_mod, type='response')
)
thold = optimal.thresholds(thresh_data, opt.methods=3)[1,2]
## print a confusion matrix using our threshold
cmx(thresh_data, thold)
## observed
## predicted 1 0
## 1 242 43
## 0 20 973
From here, we could compute sensitivity, specificity, and other measures such as the True Skill Statistic (TSS, Allouche et al 2006) to evaluate our model. The threshold will also be needed for computing range size.
Our last step is to make some maps. Recall that we have a climate
dataset, clim_pres
. We can also load in the future
climate.
clim_fut = rast("../data/bioclim_2060.grd")
In order to make maps, we need to take a few steps.
# 1. data.frame
clim_pres_df = as.data.frame(clim_pres, xy = TRUE)
clim_fut_df = as.data.frame(clim_fut, xy = TRUE)
# 2. Rescale
for(v in names(clim_pres)) {
clim_pres_df[[v]] = (clim_pres_df[[v]] - scaling$center[v]) / scaling$scale[v]
clim_fut_df[[v]] = (clim_fut_df[[v]] - scaling$center[v]) / scaling$scale[v]
}
# 3. Predict
clim_pres_df$suitability = predict(reduced_mod, newdata = clim_pres_df, type='response')
clim_fut_df$suitability = predict(reduced_mod, newdata = clim_fut_df, type='response')
clim_pres_df$presence = ifelse(clim_pres_df$suitability > thold, 1, NA)
clim_fut_df$presence = ifelse(clim_fut_df$suitability > thold, 1, NA)
# 4. Convert back to raster
pr_suitability = rast(clim_pres_df[, c('x', 'y', 'suitability')],
type = 'xyz', crs = crs(clim_pres))
fut_suitability = rast(clim_fut_df[, c('x', 'y', 'suitability')],
type = 'xyz', crs = crs(clim_pres))
pr_presence = rast(clim_pres_df[, c('x', 'y', 'presence')],
type = 'xyz', crs = crs(clim_pres))
fut_presence = rast(clim_fut_df[, c('x', 'y', 'presence')],
type = 'xyz', crs = crs(clim_pres))
# 5. Plot
p1 = ggplot() + geom_spatraster(data = pr_suitability, aes(fill=suitability)) +
scale_fill_scico(palette = "lapaz", direction = 1, na.value = "transparent") +
labs(fill="Suitability") +
geom_sf(data = countries, colour='black', fill=NA)
## <SpatRaster> resampled to 501200 cells.
p2 = ggplot() + geom_spatraster(data =fut_suitability, aes(fill=suitability)) +
scale_fill_scico(palette = "lapaz", direction = 1, na.value = "transparent") +
labs(fill="Suitability") +
geom_sf(data = countries, colour='black', fill=NA)
## <SpatRaster> resampled to 501200 cells.
p3 = ggplot() + geom_sf(data = countries, colour=NA, fill="white") +
geom_spatraster(data = pr_presence, aes(fill=presence)) +
scale_fill_gradient(low = "#b41f1f", high = "#b41f1f", na.value = "transparent") +
ggtitle("Current Range") + guides(fill='none') +
geom_sf(data = countries, colour='black', fill=NA)
## <SpatRaster> resampled to 501200 cells.
p4 = ggplot() + geom_sf(data = countries, colour=NA, fill="white") +
geom_spatraster(data = fut_presence, aes(fill=presence)) +
scale_fill_gradient(low = "#b41f1f", high = "#b41f1f", na.value = "transparent") +
ggtitle("Current Range") + guides(fill='none') +
geom_sf(data = countries, colour='black', fill=NA)
## <SpatRaster> resampled to 501200 cells.
grid.arrange(p1, p2, p3, p4, nrow=2)
Note that our model includes many areas from which the shrew is likely absent (perhaps due to dispersal constraints, historical reasons, etc). So our range maps here could perhaps be better described as maps of potentially suitable habitat, and not a 100% accurate map of the current range. What would we/could we change about our approach to get a model of the actual shrew range?
We can now return to our previous function for computing range size to compare present and future range sizes. For this, we will convert our data frame predictions back into a raster.
#' Input: a raster containing a species range
#' NA cells are out of range, all others are in range
range_size = function(r, in_units = "m^2", out_units = "km^2") {
cs = cellSize(r)
mcs = mask(cs, r)
vals = values(mcs)
rs = units::set_units(sum(vals, na.rm = TRUE), in_units, mode = "standard")
if(in_units != out_units)
rs = units::set_units(rs, out_units, mode = "standard")
rs
}
cat("Present:", range_size(pr_presence), "km^2\n",
"Future:", range_size(fut_presence), "km^2\n")
## Present: 224860.4 km^2
## Future: 105230 km^2
We predict by 2060 that more than half of the suitable habitat will be gone.
Feel free to experiment with other model types. A suggsetion for the
presence-only version of the data would be to use the function
?maxent
from the dismo
package. You could also
explore a gam
model from the gam
package, or
RandomForest
from the RandomForest
package.
Allouche, Omri, Asaf Tsoar, and Ronen Kadmon. 2006. “Assessing the Accuracy of Species Distribution Models: Prevalence, Kappa and the True Skill Statistic (Tss).” Journal of Applied Ecology 43: 1223–32.
Araujo, Miguel B., Richard G. Pearson, Wilfried Thuiller, and Markus Erhard. 2005. “Validation of Species-Climate Impact Models Under Climate Change.” Global Change Biology 11: 1504–13.
Barbet‐Massin, M., Jiguet, F., Albert, C.H. and Thuiller, W. (2012), Selecting pseudo‐absences for species distribution models: how, where and how many?. Methods in Ecology and Evolution, 3: 327-338. https://doi.org/10.1111/j.2041-210X.2011.00172.x
Dormann, C. F., J. Elith, S. Bacher, C. Buchmann, G. Carl, G. Carre, J. R. Garcia Marquez, et al. 2013. “Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance.” Ecography 36: 27–46.
Elith, J., and J. R. Leathwick. 2009. “Species Distribution Models: Ecological Explanation and Prediction Across Space and Time.” Annual Review of Ecology, Evolution, and Systematics 40: 677–97.
Franklin, J. 2010. Mapping Species Distributions: Spatial Inference and Prediction. Cambride University Press.
Liu, C., P. M. Berry, T. P. Dawson, and R. G. Pearson. 2005. “Selecting Thresholds of Occurrence in the Prediction of Species Distributions.” Ecography 28: 385–93.
Modified with thanks from Damaris Zurell, CC-BY 4.0