There are a number of large public databases that can be useful for
macroecological analyses. We will begin with a few that will be useful
for species distribution modelling. But before we start, remember to
install any packages you don’t have on your computer with
install.packages("package_name")
.
We will use the Global Biodiversity
Information Facility (GBIF) to access species occurrence records.
This is an open database of worldwide occurrences of species from all
taxonomic groups. We can use the rgbif
package to interact
with the database directly from R.
# here is a list of packages you need for this exercise
library(rgbif)
library(rnaturalearth)
library(sf)
library(CoordinateCleaner)
library(ggplot2)
library(terra)
library(geodata)
## How many occurrences recorded in Austria?
occ_count(country="AT")
## [1] 13035179
## you can get the ISO codes for other countries with this variable:
## isocodes
## GBIF has many kinds of occurrence records. Here we ask for species that were observed (i.e., no specimen was taken)
occ_count(country="AT", basisOfRecord = "OBSERVATION")
## [1] 1540
We will use Sorex alpinus, the alpine shrew, as an example taxon for this exercise. The alpine shrew can be uncommonly found in Tirol, and is listed as near threatened by the IUCN.
Before looking up a species in GBIF, it’s a good idea to check for synonyms:
(tkey <- name_suggest(q='Sorex alpinus', rank='species'))
## Records returned [1]
## No. unique hierarchies [0]
## Args [q=Sorex alpinus, limit=100, rank=species, fields1=key,
## fields2=canonicalName, fields3=rank]
## # A tibble: 1 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 2435986 Sorex alpinus SPECIES
This indicates only one known name, so we proceed with a record search.
occ_count(taxonKey = tkey$data$key)
## [1] 2036
shrew = occ_search(taxonKey = tkey$data$key, limit=1200)$data
## remove NA coordinates
shrew = subset(shrew, !is.na(decimalLatitude))
## this is a very large table, so we just check a piece of it
head(shrew[,1:10])
## # A tibble: 6 × 10
## key scientificName decimalLatitude decimalLongitude issues datasetKey
## <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 4880822152 Sorex alpinus S… 47.8 13.2 "" 8a863029-…
## 2 4920858783 Sorex alpinus S… 47.8 13.2 "cdc" c1492854-…
## 3 4851859068 Sorex alpinus S… 46.4 12.2 "cdc,… 50c9509d-…
## 4 4901479303 Sorex alpinus S… 47.4 11.0 "cdc,… 50c9509d-…
## 5 4901510667 Sorex alpinus S… 47.4 10.9 "cdc,… 50c9509d-…
## 6 4881209229 Sorex alpinus S… 47.8 13.2 "" 8a863029-…
## # ℹ 4 more variables: publishingOrgKey <chr>, installationKey <chr>,
## # hostingOrganizationKey <chr>, publishingCountry <chr>
GBIF is a public database, so it’s not uncommon to find incorrect records. It’s important therefore to start by checking for and removing any obvious problems. One of the easiest ways is to plot the data on a map. Later, we will also download an expert-drawn range map and compare it to the occurrences.
countries = ne_countries(scale="medium", returnclass = "sf")
shrew_sf = st_as_sf(shrew, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
ggplot(countries) +
geom_sf(colour='black', fill='white') + xlim(c(5,130)) + ylim(c(20,60)) +
geom_sf(data = shrew_sf, col='red')
One record looks suspicious; our shrew probably does not occur in Northeast China. But there are many other ways that GBIF records can be less precise than we want. For example, the location precision field tells us how well the location is known for each record; we may want to remove records that are not very precise.
## get a histogram of the range of location precision, in km, and tally the number of NAs
hist(shrew$coordinateUncertaintyInMeters/1000, main = "", xlab="Location uncertainty (km)")
legend("topright", legend = paste("NA Count:", sum(is.na(shrew$coordinateUncertaintyInMeters))))
Some have uncertainties of thousands of km’s! Lets view the histogram for only relatively precise locations.
## get a histogram of the range of location precision, in km, and tally the number of NAs
hist(subset(shrew, coordinateUncertaintyInMeters <= 50000)$coordinateUncertaintyInMeters/1000,
main = "", xlab="Location uncertainty (km)", breaks=20)
legend("topright", legend = paste("NA Count:", sum(is.na(shrew$coordinateUncertaintyInMeters))))
Let’s drop all records with uncertainties greater than 10 km. We will keep the NAs, because we aren’t sure what this means, but dropping them is also a rational choice. Additionally, there are many old records. Because we will use contemporary climate to model the species’ range, we should remove old records to avoid bias in areas with rapidly changing climate.
shrew = subset(shrew, is.na(coordinateUncertaintyInMeters) | coordinateUncertaintyInMeters < 10*1000)
shrew = subset(shrew, year > 1970)
There are many other manual checks we could do, and we could try to
delete points by hand this way. But a better approach is to use
workflows created by experts. So we will use the function
clean_coordinates()
from the CoordinateCleaner
package. We run a number of checks here. Take a moment to think
about th e logic of these checks. Other checks can be viewed in the help
using ?clean_coordinates
.
capitals
: Are coordinates within 5000 m of national
capitals?centroids
: Within 1000 m of country centroids?institutions
: Within 100 m of a known biodiversity
institute?outliers
: Looks for extreme geographic outliersshrew_clean = clean_coordinates(shrew, lon="decimalLongitude", lat="decimalLatitude", capitals_rad = 5000,
countries="countryCode", tests=c("capitals", "centroids", "institutions", "outliers"))
## Testing coordinate validity
## Flagged 0 records.
## Testing country capitals
## Flagged 2 records.
## Testing country centroids
## Flagged 0 records.
## Testing geographic outliers
## Flagged 12 records.
## Testing biodiversity institutions
## Flagged 0 records.
## Flagged 14 of 1031 records, EQ = 0.01.
shrew_sf = st_as_sf(shrew, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
shrew_sf$status = ifelse(shrew_clean$.summary, "ok", "flagged")
wmap = ggplot(countries) +
geom_sf(colour='black', fill='white') + xlim(c(5,130)) + ylim(c(20,60)) + xlab("") + ylab("")
wmap + geom_sf(data = shrew_sf, aes(colour = status))
The outlier in China is flagged, but all other records seem ok, so we remove the outlier and proceed.
shrew = shrew[shrew_clean$.summary,]
shrew_sf = st_as_sf(shrew, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
wmap = ggplot(countries) +
geom_sf(colour='black', fill='white') + xlim(c(5,27)) + ylim(c(40,55)) + xlab("") + ylab("")
wmap + geom_sf(data = shrew_sf, colour="#00BFC4", size=0.7)
The help files and vignettes for CoordinateCleaner detail other
issues that can be found using GBIF data, and are worth a read, as is
the methods paper where they describe in detail how to go about cleaning
the data (see citation('CoordinateCleaner')
).
Point data are helpful for describing specific occurrences of a species, but they often come with biases (e.g., there may be many more records in easily accessible places). Thus, we often also need to rely on maps drawn by experts. The IUCN makes such maps available for free (https://www.iucnredlist.org/resources/spatial-data-download), and there are many other such sources online. Access to data is provided for approved reasons; our data is available for you from the instructor. I have saved it in the “data” folder inside my working directory.
shrew_range = readRDS("../data/sorex.rds")
wmap_r = wmap + geom_sf(data = shrew_range, colour = "#aa1308", fill="#F8766D", alpha=0.5)
(wmap_rp <- wmap_r + geom_sf(data = shrew_sf, colour="#00BFC4", size=0.7))
For the current exercise, we are mostly interested in viewing the locations with respect to the expert-derived range maps. However, we often want to do some analyses with these programs. Here we can practise some code that we will use later on.
We will use the terra
package to produce a raster from
our range map.
First we must create a template raster that tells us at what resolution we want our raster map to be in. The problem is that it is unclear at which spatial resolution the range maps accurately represent species occurrences. Hurlbert and Jetz (2007) and Jetz, et al (2012) define the minimum spatial resolution as 100-200km (1-2°), although also resolutions of 50km (0.5°) and finer have been used (Krosby et al. 2015). We will use a (relatively fine) resolution of 10 arcminutes to allow us to intersect the raster range map with climate data.
## Create a template raster
template = rast(ext(shrew_range), resolution = 1/6, crs = crs(shrew_sf))
shrew_range_r = rasterize(shrew_range, template)
## convert to dataframe for plotting
wmap_r + geom_tile(data = as.data.frame(shrew_range_r, xy = TRUE),
aes(x=x, y=y), fill = "#619CFF", alpha = 0.5,
show.legend = FALSE)
Many analyses deal with range sizes. We might naively compute the sum of all of the cells in our raster; however this presents a problem. Why is this? Think about the projection of the data:
crs(shrew_range_r)
## [1] "GEOGCRS[\"WGS 84\",\n ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n MEMBER[\"World Geodetic System 1984 (Transit)\"],\n MEMBER[\"World Geodetic System 1984 (G730)\"],\n MEMBER[\"World Geodetic System 1984 (G873)\"],\n MEMBER[\"World Geodetic System 1984 (G1150)\"],\n MEMBER[\"World Geodetic System 1984 (G1674)\"],\n MEMBER[\"World Geodetic System 1984 (G1762)\"],\n MEMBER[\"World Geodetic System 1984 (G2139)\"],\n MEMBER[\"World Geodetic System 1984 (G2296)\"],\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]],\n ENSEMBLEACCURACY[2.0]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS[\"geodetic latitude (Lat)\",north,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n AXIS[\"geodetic longitude (Lon)\",east,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n USAGE[\n SCOPE[\"Horizontal component of 3D system.\"],\n AREA[\"World.\"],\n BBOX[-90,-180,90,180]],\n ID[\"EPSG\",4326]]"
What is the area of one cell?
There are a number of easy functions we can use to compute the area correctly, either from the original polygons or from the raster. Here we compute it from both, converting from square meters to square kilometers in the process
## First with polygons
area_m2 = sum(st_area(shrew_range)) ## sum because there are multiple polygons
units::set_units(area_m2, "km^2")
## 421434.6 [km^2]
## next with raster
# the cell size itself is a raster; what happens when you plot it?
cell_size_r = cellSize(shrew_range_r)
# we only want to count cells that are in the range, so we mask NAs
cell_size_r = mask(cell_size_r, shrew_range_r)
# then we extract the values, sum them, and convert units
range_size = units::set_units(sum(values(cell_size_r), na.rm = TRUE), "m^2")
# convert to km^2
units::set_units(range_size, "km^2")
## 422759 [km^2]
Notice that the two results are very close, but not identical. Why is this?
Centroids are useful for computing the “average” location of the species. Note that it is entirely possible for this to be located in inappropriate habitat or outside of the range! These can be useful, for example, if one wants to ask, “how far north will a species migrate in response to climate change?”
Note that centroids don’t make sense for latitude/longitude data; the length of one degree is not constant! So we must project our data; here we use the Lambert Equal Area Conic projection (epsg:3035), which is useful and recommended for spatial analyses in continental Europe.
shrew_proj = st_transform(shrew_range, crs=3035)
## Combine all polygons into one
shrew_proj = st_union(shrew_proj)
## This gives us a single point, which we convert back to latitude/longitude
ctr = st_transform(st_centroid(shrew_proj), crs=crs(shrew_range))
wmap_rp + geom_sf(data = ctr, size=4)
If we want to mdoel the range as a function of climate, we also need
to aquire spatial climate data. There are a number of sources for this;
we will use the tried-and-true WorldClim dataset, which we can access
directly from the worldclim
package. However, there are
newer datasets available; in particular the Chelsa Climate Dataset is quite
good and freely available. There are also R packages that can give you
access to these data directly from R.
We will use the bioclim climate variables for our analysis; be sure to read over the link for information on what these variables represent and why they are useful in SDM applications. We get 19 (!) variables in bioclim. Note that the Chelsea dataset provides even more bioclim variables.
## Note that this can be slow, btu it should only run once per computer
## make sure the directory exists
path = "../data/clim"
dir.create(path, showWarnings = FALSE)
clim = worldclim_global(var = "bio", res = 10, path = path)
plot(clim)
We have already set the appropriate resolution using the
res
argument. However, the terra
package
provides a lot of functions that are useful for manipulating these data
in case we need to change resolutions, extents, etc. It is useful to
browse the help for these to know they are available:
?resample
?aggregate
?crop
?extend
We now have two sources of information on the distribution of the
alpine shrew, the occurrences from GBIF (shrew_sf
) and the
range map in both polygon (shrew_range
) or raster
(shrew_range_r
) format. We can use functions in the terra
package to get the climate values for either all of the points, or all
of the cells in the raster.
## first for the GBIF points
shrew_sf = cbind(shrew_sf, extract(clim, shrew_sf))
head(shrew_sf[, grep("bio", colnames(shrew_sf))])
## Simple feature collection with 6 features and 19 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 8.78214 ymin: 46.83075 xmax: 13.21479 ymax: 47.84
## Geodetic CRS: WGS 84
## wc2.1_10m_bio_1 wc2.1_10m_bio_2 wc2.1_10m_bio_3 wc2.1_10m_bio_4
## 1 6.505573 8.508604 33.53277 649.9034
## 2 6.505573 8.508604 33.53277 649.9034
## 3 8.149062 8.922417 32.77950 698.9550
## 4 8.149062 8.922417 32.77950 698.9550
## 5 4.665219 7.072521 31.67273 580.0095
## 6 1.040583 6.332000 29.90319 548.9826
## wc2.1_10m_bio_5 wc2.1_10m_bio_6 wc2.1_10m_bio_7 wc2.1_10m_bio_8
## 1 19.95200 -5.42200 25.3740 14.466125
## 2 19.95200 -5.42200 25.3740 14.466125
## 3 22.49275 -4.72675 27.2195 16.664541
## 4 22.49275 -4.72675 27.2195 16.664541
## 5 16.78725 -5.54275 22.3300 11.922167
## 6 12.59675 -8.57825 21.1750 7.937334
## wc2.1_10m_bio_9 wc2.1_10m_bio_10 wc2.1_10m_bio_11 wc2.1_10m_bio_12
## 1 -0.3784584 14.46612 -1.3513334 1380
## 2 -0.3784584 14.46612 -1.3513334 1380
## 3 0.8570418 16.66454 -0.4938749 1223
## 4 0.8570418 16.66454 -0.4938749 1223
## 5 -1.5429583 11.92217 -1.9702084 1722
## 6 -4.9315000 8.01225 -4.9315000 1659
## wc2.1_10m_bio_13 wc2.1_10m_bio_14 wc2.1_10m_bio_15 wc2.1_10m_bio_16
## 1 188 75 32.60979 522
## 2 188 75 32.60979 522
## 3 159 66 32.25150 460
## 4 159 66 32.25150 460
## 5 198 106 20.91898 564
## 6 170 101 15.71897 498
## wc2.1_10m_bio_17 wc2.1_10m_bio_18 wc2.1_10m_bio_19 geometry
## 1 255 522 261 POINT (13.21479 47.77657)
## 2 255 522 261 POINT (13.21 47.78)
## 3 222 460 222 POINT (13.16295 47.8373)
## 4 222 460 222 POINT (13.16 47.84)
## 5 334 564 343 POINT (9.24984 47.09423)
## 6 332 481 332 POINT (8.78214 46.83075)
## Here, for example, are the points, coloured by bio1, which is the mean annual temperature
wmap + geom_sf(data = shrew_range, colour = "#aa1308", fill="white", alpha=0.4) +
geom_sf(data = shrew_sf, aes(colour = wc2.1_10m_bio_1), size=0.7) +
labs(colour = "Mean Annual Temperature (°C)") + scale_color_viridis_c(option = "magma")
We can also intersect the rasters if we want climate variables for
the range polygons. Here we intersect bio18
, which is the
precipitation during the warmest quarter. Why might this be a sensible
variable for this organism?
# fails! we need to align the rasters
# range_18 = intersect(shrew_range_r, clim$wc2.1_10m_bio_18)
# first we must make sure that the bioclim variable we
# want has the same extent as the raster for the range
bio18 = resample(clim$wc2.1_10m_bio_18, shrew_range_r)
# next we mask to remove points outside the range
range_18 = mask(bio18, shrew_range_r)
# finally convert to a dataframe for plotting
range_18_pts = as.data.frame(range_18, xy = TRUE)
colnames(range_18_pts) = c("x", "y", "bio18")
wmap + geom_tile(data = range_18_pts, aes(x=x, y=y, fill=bio18), alpha = 0.8) + scale_fill_viridis_c(direction=-1)
Hurlbert, Allen H., and Walter Jetz. 2007. “Species Richness, Hotspots, and the Scale Dependence of Range Maps in Ecology and Conservation.” PNAS 104: 13384–9.
Jetz, Walter, Jana M. McPherson, and Robert P. Guralnick. 2012. “Integrating Biodiversity Distribution Knowledge: Toward a Global Map of Life.” Trends in Ecology & Evolution 27: 151–59.
Krosby, Meade, Chad B. Wilsey, Jenny L. McGuire, Jennifer M. Duggan, Theresa M. Nogeire, Julie A. Heinrichs, Joshua J. Tewksbury, and Joshua J. Lawler. 2015. “Climate-Induced Range Overlap Among Closely Related Species.” Nature Climate
Modified with thanks from Damaris Zurell, CC-BY 4.0