Species Distribution Models: Data Preparation

There are a number of large public databases that can be useful for macroecological analyses. We will begin with a few that will be useful for species distribution modelling. But before we start, remember to install any packages you don’t have on your computer with install.packages("package_name").

1. Point occurrence records

We will use the Global Biodiversity Information Facility (GBIF) to access species occurrence records. This is an open database of worldwide occurrences of species from all taxonomic groups. We can use the rgbif package to interact with the database directly from R.

# here is a list of packages you need for this exercise
library(rgbif)
library(rnaturalearth)
library(sf)
library(CoordinateCleaner)
library(ggplot2)
library(terra)
library(geodata)

## How many occurrences recorded in Austria?
occ_count(country="AT")

## [1] 13035179

## you can get the ISO codes for other countries with this variable:
## isocodes

## GBIF has many kinds of occurrence records. Here we ask for species that were observed (i.e., no specimen was taken)
occ_count(country="AT", basisOfRecord = "OBSERVATION")

## [1] 1540

1.1 Downloading records

We will use Sorex alpinus, the alpine shrew, as an example taxon for this exercise. The alpine shrew can be uncommonly found in Tirol, and is listed as near threatened by the IUCN.

Before looking up a species in GBIF, it’s a good idea to check for synonyms:

(tkey <- name_suggest(q='Sorex alpinus', rank='species'))

## Records returned [1] 
## No. unique hierarchies [0] 
## Args [q=Sorex alpinus, limit=100, rank=species, fields1=key,
##      fields2=canonicalName, fields3=rank] 
## # A tibble: 1 × 3
##       key canonicalName rank   
##     <int> <chr>         <chr>  
## 1 2435986 Sorex alpinus SPECIES

This indicates only one known name, so we proceed with a record search.

occ_count(taxonKey = tkey$data$key)

## [1] 2036

shrew = occ_search(taxonKey = tkey$data$key, limit=1200)$data

## remove NA coordinates
shrew = subset(shrew, !is.na(decimalLatitude))

## this is a very large table, so we just check a piece of it
head(shrew[,1:10])

## # A tibble: 6 × 10
##   key        scientificName   decimalLatitude decimalLongitude issues datasetKey
##   <chr>      <chr>                      <dbl>            <dbl> <chr>  <chr>     
## 1 4880822152 Sorex alpinus S…            47.8             13.2 ""     8a863029-…
## 2 4920858783 Sorex alpinus S…            47.8             13.2 "cdc"  c1492854-…
## 3 4851859068 Sorex alpinus S…            46.4             12.2 "cdc,… 50c9509d-…
## 4 4901479303 Sorex alpinus S…            47.4             11.0 "cdc,… 50c9509d-…
## 5 4901510667 Sorex alpinus S…            47.4             10.9 "cdc,… 50c9509d-…
## 6 4881209229 Sorex alpinus S…            47.8             13.2 ""     8a863029-…
## # ℹ 4 more variables: publishingOrgKey <chr>, installationKey <chr>,
## #   hostingOrganizationKey <chr>, publishingCountry <chr>

1.2 Data cleaning

GBIF is a public database, so it’s not uncommon to find incorrect records. It’s important therefore to start by checking for and removing any obvious problems. One of the easiest ways is to plot the data on a map. Later, we will also download an expert-drawn range map and compare it to the occurrences.

countries = ne_countries(scale="medium", returnclass = "sf")
shrew_sf = st_as_sf(shrew, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)

ggplot(countries) + 
    geom_sf(colour='black', fill='white') + xlim(c(5,130)) + ylim(c(20,60)) + 
    geom_sf(data = shrew_sf, col='red')

One record looks suspicious; our shrew probably does not occur in Northeast China. But there are many other ways that GBIF records can be less precise than we want. For example, the location precision field tells us how well the location is known for each record; we may want to remove records that are not very precise.

## get a histogram of the range of location precision, in km, and tally the number of NAs
hist(shrew$coordinateUncertaintyInMeters/1000, main = "", xlab="Location uncertainty (km)")
legend("topright", legend = paste("NA Count:", sum(is.na(shrew$coordinateUncertaintyInMeters))))

Some have uncertainties of thousands of km’s! Lets view the histogram for only relatively precise locations.

## get a histogram of the range of location precision, in km, and tally the number of NAs
hist(subset(shrew, coordinateUncertaintyInMeters <= 50000)$coordinateUncertaintyInMeters/1000, 
     main = "", xlab="Location uncertainty (km)", breaks=20)
legend("topright", legend = paste("NA Count:", sum(is.na(shrew$coordinateUncertaintyInMeters))))

Let’s drop all records with uncertainties greater than 10 km. We will keep the NAs, because we aren’t sure what this means, but dropping them is also a rational choice. Additionally, there are many old records. Because we will use contemporary climate to model the species’ range, we should remove old records to avoid bias in areas with rapidly changing climate.

shrew = subset(shrew, is.na(coordinateUncertaintyInMeters) | coordinateUncertaintyInMeters < 10*1000)
shrew = subset(shrew, year > 1970)

There are many other manual checks we could do, and we could try to delete points by hand this way. But a better approach is to use workflows created by experts. So we will use the function clean_coordinates() from the CoordinateCleaner package. We run a number of checks here. Take a moment to think about th e logic of these checks. Other checks can be viewed in the help using ?clean_coordinates.

capitals: Are coordinates within 5000 m of national capitals?
centroids: Within 1000 m of country centroids?
institutions: Within 100 m of a known biodiversity institute?
outliers: Looks for extreme geographic outliers

shrew_clean = clean_coordinates(shrew, lon="decimalLongitude", lat="decimalLatitude", capitals_rad = 5000,
                                countries="countryCode", tests=c("capitals", "centroids", "institutions", "outliers"))

## Testing coordinate validity

## Flagged 0 records.

## Testing country capitals

## Flagged 2 records.

## Testing country centroids

## Flagged 0 records.

## Testing geographic outliers

## Flagged 12 records.

## Testing biodiversity institutions

## Flagged 0 records.

## Flagged 14 of 1031 records, EQ = 0.01.

shrew_sf = st_as_sf(shrew, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
shrew_sf$status = ifelse(shrew_clean$.summary, "ok", "flagged")
wmap = ggplot(countries) + 
    geom_sf(colour='black', fill='white') + xlim(c(5,130)) + ylim(c(20,60))  + xlab("") + ylab("")

wmap + geom_sf(data = shrew_sf, aes(colour = status))

The outlier in China is flagged, but all other records seem ok, so we remove the outlier and proceed.

shrew = shrew[shrew_clean$.summary,]
shrew_sf = st_as_sf(shrew, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)

wmap = ggplot(countries) + 
    geom_sf(colour='black', fill='white') + xlim(c(5,27)) + ylim(c(40,55))  + xlab("") + ylab("")
wmap + geom_sf(data = shrew_sf, colour="#00BFC4", size=0.7)

The help files and vignettes for CoordinateCleaner detail other issues that can be found using GBIF data, and are worth a read, as is the methods paper where they describe in detail how to go about cleaning the data (see citation('CoordinateCleaner')).

2. Range Maps

Point data are helpful for describing specific occurrences of a species, but they often come with biases (e.g., there may be many more records in easily accessible places). Thus, we often also need to rely on maps drawn by experts. The IUCN makes such maps available for free (https://www.iucnredlist.org/resources/spatial-data-download), and there are many other such sources online. Access to data is provided for approved reasons; our data is available for you from the instructor. I have saved it in the “data” folder inside my working directory.

shrew_range = readRDS("../data/sorex.rds")
wmap_r = wmap + geom_sf(data = shrew_range, colour = "#aa1308", fill="#F8766D", alpha=0.5)
(wmap_rp <- wmap_r + geom_sf(data = shrew_sf, colour="#00BFC4", size=0.7))

For the current exercise, we are mostly interested in viewing the locations with respect to the expert-derived range maps. However, we often want to do some analyses with these programs. Here we can practise some code that we will use later on.

2.1 Producing raster maps

We will use the terra package to produce a raster from our range map.

First we must create a template raster that tells us at what resolution we want our raster map to be in. The problem is that it is unclear at which spatial resolution the range maps accurately represent species occurrences. Hurlbert and Jetz (2007) and Jetz, et al (2012) define the minimum spatial resolution as 100-200km (1-2°), although also resolutions of 50km (0.5°) and finer have been used (Krosby et al. 2015). We will use a (relatively fine) resolution of 10 arcminutes to allow us to intersect the raster range map with climate data.

## Create a template raster
template = rast(ext(shrew_range), resolution = 1/6, crs = crs(shrew_sf))
shrew_range_r = rasterize(shrew_range, template)

## convert to dataframe for plotting
wmap_r + geom_tile(data = as.data.frame(shrew_range_r, xy = TRUE), 
              aes(x=x, y=y), fill = "#619CFF", alpha = 0.5, 
              show.legend = FALSE)

2.1 Range size

Many analyses deal with range sizes. We might naively compute the sum of all of the cells in our raster; however this presents a problem. Why is this? Think about the projection of the data:

crs(shrew_range_r)

## [1] "GEOGCRS[\"WGS 84\",\n    ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n        MEMBER[\"World Geodetic System 1984 (Transit)\"],\n        MEMBER[\"World Geodetic System 1984 (G730)\"],\n        MEMBER[\"World Geodetic System 1984 (G873)\"],\n        MEMBER[\"World Geodetic System 1984 (G1150)\"],\n        MEMBER[\"World Geodetic System 1984 (G1674)\"],\n        MEMBER[\"World Geodetic System 1984 (G1762)\"],\n        MEMBER[\"World Geodetic System 1984 (G2139)\"],\n        MEMBER[\"World Geodetic System 1984 (G2296)\"],\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]],\n        ENSEMBLEACCURACY[2.0]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    USAGE[\n        SCOPE[\"Horizontal component of 3D system.\"],\n        AREA[\"World.\"],\n        BBOX[-90,-180,90,180]],\n    ID[\"EPSG\",4326]]"

What is the area of one cell?

There are a number of easy functions we can use to compute the area correctly, either from the original polygons or from the raster. Here we compute it from both, converting from square meters to square kilometers in the process

## First with polygons
area_m2 = sum(st_area(shrew_range)) ## sum because there are multiple polygons
units::set_units(area_m2, "km^2")

## 421434.6 [km^2]

## next with raster
# the cell size itself is a raster; what happens when you plot it?
cell_size_r = cellSize(shrew_range_r)
# we only want to count cells that are in the range, so we mask NAs
cell_size_r = mask(cell_size_r, shrew_range_r)
# then we extract the values, sum them, and convert units
range_size = units::set_units(sum(values(cell_size_r), na.rm = TRUE), "m^2")
# convert to km^2
units::set_units(range_size, "km^2")

## 422759 [km^2]

Notice that the two results are very close, but not identical. Why is this?

2.2 Centroids

Centroids are useful for computing the “average” location of the species. Note that it is entirely possible for this to be located in inappropriate habitat or outside of the range! These can be useful, for example, if one wants to ask, “how far north will a species migrate in response to climate change?”

Note that centroids don’t make sense for latitude/longitude data; the length of one degree is not constant! So we must project our data; here we use the Lambert Equal Area Conic projection (epsg:3035), which is useful and recommended for spatial analyses in continental Europe.

shrew_proj = st_transform(shrew_range, crs=3035)

## Combine all polygons into one
shrew_proj = st_union(shrew_proj)

## This gives us a single point, which we convert back to latitude/longitude
ctr = st_transform(st_centroid(shrew_proj), crs=crs(shrew_range))
wmap_rp + geom_sf(data = ctr, size=4)

3 Climate Data

If we want to mdoel the range as a function of climate, we also need to aquire spatial climate data. There are a number of sources for this; we will use the tried-and-true WorldClim dataset, which we can access directly from the worldclim package. However, there are newer datasets available; in particular the Chelsa Climate Dataset is quite good and freely available. There are also R packages that can give you access to these data directly from R.

We will use the bioclim climate variables for our analysis; be sure to read over the link for information on what these variables represent and why they are useful in SDM applications. We get 19 (!) variables in bioclim. Note that the Chelsea dataset provides even more bioclim variables.

## Note that this can be slow, btu it should only run once per computer

## make sure the directory exists
path = "../data/clim"
dir.create(path, showWarnings = FALSE)

clim = worldclim_global(var = "bio", res = 10, path = path)
plot(clim)

We have already set the appropriate resolution using the res argument. However, the terra package provides a lot of functions that are useful for manipulating these data in case we need to change resolutions, extents, etc. It is useful to browse the help for these to know they are available:

?resample
?aggregate
?crop
?extend

3.1 Joining (intersecting) occurrence and climate data

We now have two sources of information on the distribution of the alpine shrew, the occurrences from GBIF (shrew_sf) and the range map in both polygon (shrew_range) or raster (shrew_range_r) format. We can use functions in the terra package to get the climate values for either all of the points, or all of the cells in the raster.

## first for the GBIF points
shrew_sf = cbind(shrew_sf, extract(clim, shrew_sf))
head(shrew_sf[, grep("bio", colnames(shrew_sf))])

## Simple feature collection with 6 features and 19 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 8.78214 ymin: 46.83075 xmax: 13.21479 ymax: 47.84
## Geodetic CRS:  WGS 84
##   wc2.1_10m_bio_1 wc2.1_10m_bio_2 wc2.1_10m_bio_3 wc2.1_10m_bio_4
## 1        6.505573        8.508604        33.53277        649.9034
## 2        6.505573        8.508604        33.53277        649.9034
## 3        8.149062        8.922417        32.77950        698.9550
## 4        8.149062        8.922417        32.77950        698.9550
## 5        4.665219        7.072521        31.67273        580.0095
## 6        1.040583        6.332000        29.90319        548.9826
##   wc2.1_10m_bio_5 wc2.1_10m_bio_6 wc2.1_10m_bio_7 wc2.1_10m_bio_8
## 1        19.95200        -5.42200         25.3740       14.466125
## 2        19.95200        -5.42200         25.3740       14.466125
## 3        22.49275        -4.72675         27.2195       16.664541
## 4        22.49275        -4.72675         27.2195       16.664541
## 5        16.78725        -5.54275         22.3300       11.922167
## 6        12.59675        -8.57825         21.1750        7.937334
##   wc2.1_10m_bio_9 wc2.1_10m_bio_10 wc2.1_10m_bio_11 wc2.1_10m_bio_12
## 1      -0.3784584         14.46612       -1.3513334             1380
## 2      -0.3784584         14.46612       -1.3513334             1380
## 3       0.8570418         16.66454       -0.4938749             1223
## 4       0.8570418         16.66454       -0.4938749             1223
## 5      -1.5429583         11.92217       -1.9702084             1722
## 6      -4.9315000          8.01225       -4.9315000             1659
##   wc2.1_10m_bio_13 wc2.1_10m_bio_14 wc2.1_10m_bio_15 wc2.1_10m_bio_16
## 1              188               75         32.60979              522
## 2              188               75         32.60979              522
## 3              159               66         32.25150              460
## 4              159               66         32.25150              460
## 5              198              106         20.91898              564
## 6              170              101         15.71897              498
##   wc2.1_10m_bio_17 wc2.1_10m_bio_18 wc2.1_10m_bio_19                  geometry
## 1              255              522              261 POINT (13.21479 47.77657)
## 2              255              522              261       POINT (13.21 47.78)
## 3              222              460              222  POINT (13.16295 47.8373)
## 4              222              460              222       POINT (13.16 47.84)
## 5              334              564              343  POINT (9.24984 47.09423)
## 6              332              481              332  POINT (8.78214 46.83075)

## Here, for example, are the points, coloured by bio1, which is the mean annual temperature
wmap + geom_sf(data = shrew_range, colour = "#aa1308", fill="white", alpha=0.4) + 
    geom_sf(data = shrew_sf, aes(colour = wc2.1_10m_bio_1), size=0.7) + 
    labs(colour = "Mean Annual Temperature (°C)") + scale_color_viridis_c(option = "magma")

We can also intersect the rasters if we want climate variables for the range polygons. Here we intersect bio18, which is the precipitation during the warmest quarter. Why might this be a sensible variable for this organism?

# fails! we need to align the rasters
# range_18 = intersect(shrew_range_r, clim$wc2.1_10m_bio_18)

# first we must make sure that the bioclim variable we 
# want has the same extent as the raster for the range
bio18 = resample(clim$wc2.1_10m_bio_18, shrew_range_r)

# next we mask to remove points outside the range
range_18 = mask(bio18, shrew_range_r)

# finally convert to a dataframe for plotting
range_18_pts = as.data.frame(range_18, xy = TRUE)
colnames(range_18_pts) = c("x", "y", "bio18")
wmap + geom_tile(data = range_18_pts, aes(x=x, y=y, fill=bio18), alpha = 0.8) + scale_fill_viridis_c(direction=-1)

References

Hurlbert, Allen H., and Walter Jetz. 2007. “Species Richness, Hotspots, and the Scale Dependence of Range Maps in Ecology and Conservation.” PNAS 104: 13384–9.

Jetz, Walter, Jana M. McPherson, and Robert P. Guralnick. 2012. “Integrating Biodiversity Distribution Knowledge: Toward a Global Map of Life.” Trends in Ecology & Evolution 27: 151–59.

Krosby, Meade, Chad B. Wilsey, Jenny L. McGuire, Jennifer M. Duggan, Theresa M. Nogeire, Julie A. Heinrichs, Joshua J. Tewksbury, and Joshua J. Lawler. 2015. “Climate-Induced Range Overlap Among Closely Related Species.” Nature Climate

Modified with thanks from Damaris Zurell, CC-BY 4.0