Task 1: Central Limit Theorem

The state of Tirol was divided into a 5-km grid, and the area of forest in each grid cell was computed using remote sensing.

We can observe that the distribution of forest cover is strongly non-normal!

# remember to set your working directory if needed!
# setwd("unit_1")
tirol_forest = read.csv("data/tirol_forest_cover.csv")

hist(tirol_forest$for_cover_km2, breaks=50, col="gray", xlab = expression(Forest~Cover~(km^2)), main = "")

1a. Population statistics

Compute the population mean and standard deviation from the variable x in t1_population, and store them in variables named mu and sigma. What are the values?

Hint: The sd function computes the sample standard deviation; you will need to compute this manually using the formula:

\[ \sigma = \sqrt \frac{\sum \left(x - \mu \right )^2}{n} \] In code:

# population size
pop_n = nrow(tirol_forest)

# compute population mean
mu = mean(tirol_forest$for_cover_km2)

# squared deviations from mean, (x - mu)^2
sq_diff = (tirol_forest$for_cover_km2 - mu)^2

# the numerator above, the sum of squared deviations
numerator = sum(sq_diff)

# population standard deviation
sigma = sqrt(numerator/pop_n)

What is the theoretical standard error of the mean, assuming a sample size of 25? Use the \(\sigma\) you computed above.

1b. Taking a sample

The following code collects a sample of size n, from our population, then computes the sample mean and standard deviation.

n = 25
samp = sample(tirol_forest$for_cover_km2, n)
mean(samp)
sd(samp)

Run this code in your console. How do the mean and standard deviation you get compare to the known population mean and sd?

1c. Performing an ‘experiment’

Now we will get everyone together to perform a simple experiment to estimate the mean of the population from a series of samples.

Every student will take five samples (as in 1b). For each sample:
- Compute the sample mean \(\bar{x}\) and standard deviation \(s\)
- Use those values to compute a 95% confidence interval for each sample.
- Enter these in a google doc that we all share.

# get the standard error
samp_sd = sd(my_sample)
st_err = samp_sd/sqrt(n)

# compute quantiles from the t-distribution
# because this is a sample!
# why 0.975?? (hint: this is a 2-sided confidence interval)
quant = qt(0.975, df = n - 1)

# compute confidence limits
lower = samp_mean - quant * st_err 
upper = samp_mean + quant * st_err

Next, load the shared data set from the whole course, and answer the following:

Make a histogram of the sample means. How does the shape of this histogram compare to the histogram of the raw data? What about a histogram of sample standard deviations?
What is the mean of sample means? How does it compare to the population mean? What about the mean of sample sds and its comparison to the population sd?
Compute the standard deviation of the sample means (using the sd function). How does that compare to the theoretical standard error from 1a?
Compute how often (% of time) the population mean (from 1a) is between the lower and upper confidence limits.

Task 2: Coral reef fish diversity

We will use the dataset coral_fish.csv for this exercise. Load the file using the read.csv command, and save it in a variable named fish.

It’s also a good idea to have a look at the data after you read it; use the str, head, and View functions. Note that you should only use View directly in Rstudio, you should not save it as part of a script.

# You may have to set or change your working directory, or change the file name to include the folder as well
fish = read.csv("data/coral_fish.csv")

str(fish)

## 'data.frame':    214 obs. of  2 variables:
##  $ richness: int  81 95 103 91 101 96 51 95 105 103 ...
##  $ type    : chr  "tropical" "tropical" "tropical" "tropical" ...

head(fish)

##   richness     type
## 1       81 tropical
## 2       95 tropical
## 3      103 tropical
## 4       91 tropical
## 5      101 tropical
## 6       96 tropical

The data contains two variables: richness is the fish species richness (i.e., the total number of species, a measure of \(\alpha\)-diversity) for 214 coral reefs in the African biogeographical region. The other variable, type indicates whether the reef is tropical or subtropical.

One of the most well-known global biogeographical patterns is known as the latitudinal diversity gradient; as one moves towards the tropics, biodiversity tends to increase. However, the pattern does not always hold for marine organisms. We will use the fish diversity data to test the hypothesis that reef fish diversity is higher in the tropics.

2a. Exploratory figures

Make some figures exploring the data. Be sure that your figures are near publication quality; axes should be labelled with units (if appropriate), use of colour is encouraged but should help clarify, not confuse the figure, and extraneous elements should be minimised. Use this exercise to get comfortable with some of the plotting options in R.

# many of these options can be inserted in any plot call
# more options
# xlim = c(min, max) - set the axis lmits
# bty = 'n' - eliminate or change the box around the plot
# col = "green" - sets the primary color to green
# see ?plot, ?par, ?hist, ?boxplot for other options
hist(y, main = "Plot Title", xlab = "x-axis label", ylab = "y-axis label")

Make a boxplot for richness grouped by reef type. Remember the syntax:

boxplot(variable ~ group, data = fish)

Here, fish is the name of the dataset, group is the name of the variable in fish that you should use to group the plots, and variable is the value you want plotted (e.g., richness and type).

Do you see evidence that diversity differs from one reef type to the other? What about the distribution of the data? Do you think species richness is approximately normally distributed?

2b. Summary statistics

Compute summary statistics (at least: mean, sd, standard error) for both reef types. The fastest DRY way to do this is to use tapply:

with(fish, tapply(richness, type, mean))

## subtropical    tropical 
##    254.4211    427.7443

2c. Confidence intervals

Write out appropriate null and alternate hypotheses for this exercise. Is the alternate one- or two-sided?
Construct a 95% confidence interval for each mean.

\[ C.I. = \bar{x} \pm t_{\alpha, n-1}\frac{s}{\sqrt{n}} \]

\(\alpha\) is the Type I error rate, and \(n-1\) is the degrees of freedom. You can get the \(t\) quantile for any confidence interval using the qt() function.

Here we demonstrate another useful feature of R: writing your own functions. Try to take some time to understand what is happening here.

conf_interval = function(x, alpha = 0.05) {
    # the default is a 95% confidence interval, can be changed when you call the function
    n = length(x)
    xbar = mean(x)
    sem = sd(x)/sqrt(n) # standard error
    # half of alpha comes from each side of the distribution
    t_vals = qt(c(alpha/2, 1-alpha/2), n -1)
    # compute the CI
    ci = xbar + sem * t_vals
    names(ci) = c("lower", "upper")
    return(ci)
}

# use tapply to get a confidence interval for both types
with(fish, tapply(richness, type, conf_interval))

## $subtropical
##    lower    upper 
## 202.2071 306.6350 
## 
## $tropical
##    lower    upper 
## 397.1979 458.2908

# change alpha to 0.01, getting a 99% CI
with(fish, tapply(richness, type, conf_interval, alpha = 0.01))

## $subtropical
##    lower    upper 
## 184.4463 324.3958 
## 
## $tropical
##    lower    upper 
## 387.4378 468.0509

2d. Perform a t-test of the hypothesis you wrote in question 1. Use the R function t.test, and be sure to check the help with ?t.test before you start.

Pay attention to the options alternative and paired. How should you set them for this particular problem?

Report the results of the test: t-statistic, p-value, and conclusions.

Worksheet 1

VU Datenanalyse