The state of Tirol was divided into a 5-km grid, and the area of forest in each grid cell was computed using remote sensing.
We can observe that the distribution of forest cover is strongly non-normal!
# remember to set your working directory if needed!
# setwd("unit_1")
tirol_forest = read.csv("data/tirol_forest_cover.csv")
hist(tirol_forest$for_cover_km2, breaks=50, col="gray", xlab = expression(Forest~Cover~(km^2)), main = "")
Compute the population mean and standard deviation from the variable
x in t1_population, and store them in variables named mu
and sigma
. What are the values?
Hint: The sd
function computes the
sample standard deviation; you will need to compute this
manually using the formula:
\[ \sigma = \sqrt \frac{\sum \left(x - \mu \right )^2}{n} \] In code:
# population size
pop_n = nrow(tirol_forest)
# compute population mean
mu = mean(tirol_forest$for_cover_km2)
# squared deviations from mean, (x - mu)^2
sq_diff = (tirol_forest$for_cover_km2 - mu)^2
# the numerator above, the sum of squared deviations
numerator = sum(sq_diff)
# population standard deviation
sigma = sqrt(numerator/pop_n)
What is the theoretical standard error of the mean, assuming a sample size of 25? Use the \(\sigma\) you computed above.
The following code collects a sample of size n, from our population, then computes the sample mean and standard deviation.
n = 25
samp = sample(tirol_forest$for_cover_km2, n)
mean(samp)
sd(samp)
Run this code in your console. How do the mean and standard deviation you get compare to the known population mean and sd?
Now we will get everyone together to perform a simple experiment to estimate the mean of the population from a series of samples.
# get the standard error
samp_sd = sd(my_sample)
st_err = samp_sd/sqrt(n)
# compute quantiles from the t-distribution
# because this is a sample!
# why 0.975?? (hint: this is a 2-sided confidence interval)
quant = qt(0.975, df = n - 1)
# compute confidence limits
lower = samp_mean - quant * st_err
upper = samp_mean + quant * st_err
Next, load the shared data set from the whole course, and answer the following:
sd
function). How does that compare to the theoretical
standard error from 1a?We will use the dataset coral_fish.csv
for this
exercise. Load the file using the read.csv
command, and
save it in a variable named fish
.
It’s also a good idea to have a look at the data after you read it;
use the str
, head
, and View
functions. Note that you should only use View
directly in
Rstudio, you should not save it as part of a script.
# You may have to set or change your working directory, or change the file name to include the folder as well
fish = read.csv("data/coral_fish.csv")
str(fish)
## 'data.frame': 214 obs. of 2 variables:
## $ richness: int 81 95 103 91 101 96 51 95 105 103 ...
## $ type : chr "tropical" "tropical" "tropical" "tropical" ...
head(fish)
## richness type
## 1 81 tropical
## 2 95 tropical
## 3 103 tropical
## 4 91 tropical
## 5 101 tropical
## 6 96 tropical
The data contains two variables: richness
is the fish
species richness (i.e., the total number of species, a measure of \(\alpha\)-diversity) for 214 coral reefs in
the African biogeographical region. The other variable,
type
indicates whether the reef is tropical or
subtropical.
One of the most well-known global biogeographical patterns is known as the latitudinal diversity gradient; as one moves towards the tropics, biodiversity tends to increase. However, the pattern does not always hold for marine organisms. We will use the fish diversity data to test the hypothesis that reef fish diversity is higher in the tropics.
Make some figures exploring the data. Be sure that your figures are near publication quality; axes should be labelled with units (if appropriate), use of colour is encouraged but should help clarify, not confuse the figure, and extraneous elements should be minimised. Use this exercise to get comfortable with some of the plotting options in R.
# many of these options can be inserted in any plot call
# more options
# xlim = c(min, max) - set the axis lmits
# bty = 'n' - eliminate or change the box around the plot
# col = "green" - sets the primary color to green
# see ?plot, ?par, ?hist, ?boxplot for other options
hist(y, main = "Plot Title", xlab = "x-axis label", ylab = "y-axis label")
Make a boxplot for richness grouped by reef type. Remember the syntax:
boxplot(variable ~ group, data = fish)
Here, fish
is the name of the dataset,
group
is the name of the variable in fish
that
you should use to group the plots, and variable
is the
value you want plotted (e.g., richness
and
type
).
Do you see evidence that diversity differs from one reef type to the other? What about the distribution of the data? Do you think species richness is approximately normally distributed?
Compute summary statistics (at least: mean, sd, standard error) for both reef types. The fastest DRY way to do this is to use tapply:
with(fish, tapply(richness, type, mean))
## subtropical tropical
## 254.4211 427.7443
\[ C.I. = \bar{x} \pm t_{\alpha, n-1}\frac{s}{\sqrt{n}} \]
\(\alpha\) is the Type I error rate,
and \(n-1\) is the degrees of freedom.
You can get the \(t\) quantile for any
confidence interval using the qt()
function.
Here we demonstrate another useful feature of R: writing your own functions. Try to take some time to understand what is happening here.
conf_interval = function(x, alpha = 0.05) {
# the default is a 95% confidence interval, can be changed when you call the function
n = length(x)
xbar = mean(x)
sem = sd(x)/sqrt(n) # standard error
# half of alpha comes from each side of the distribution
t_vals = qt(c(alpha/2, 1-alpha/2), n -1)
# compute the CI
ci = xbar + sem * t_vals
names(ci) = c("lower", "upper")
return(ci)
}
# use tapply to get a confidence interval for both types
with(fish, tapply(richness, type, conf_interval))
## $subtropical
## lower upper
## 202.2071 306.6350
##
## $tropical
## lower upper
## 397.1979 458.2908
# change alpha to 0.01, getting a 99% CI
with(fish, tapply(richness, type, conf_interval, alpha = 0.01))
## $subtropical
## lower upper
## 184.4463 324.3958
##
## $tropical
## lower upper
## 387.4378 468.0509
2d. Perform a t-test of the hypothesis you wrote in question 1. Use
the R function t.test
, and be sure to check the help with
?t.test
before you start.
Pay attention to the options alternative
and
paired
. How should you set them for this particular
problem?
Report the results of the test: t-statistic, p-value, and conclusions.