More data manipulation

Here we have a sample of abundances of 5 different tree species in plots across a number of years:
Note that the format of the data makes some analyses easy
- What is the average abundance for Abies balsamifera across all years?
But other things are harder
- What is the average of all species combined for each year?
- What is the average for each species among years?

Today we will look at three useful functions for manipulating data frames in more complicated/interesting cases.

trees = read.csv("data/tree_abundance.csv")

head(trees)
##   pid year Abies.balsamifera Acer.saccharum Betula.papyrifera
## 1   1 1980                 4              0                 0
## 2   2 2006                40              0                 1
## 3   3 2006                36              0                 9
## 4   4 1980                 3              0                 0
## 5   5 1980                 4              0                 0
## 6   6 1980                16              0                 0
##   Populus.tremuloides Tsuga.canadensis
## 1                   0                0
## 2                   0                0
## 3                   0                0
## 4                   0                0
## 5                   0                0
## 6                   0                0

More data manipulation: tall vs. wide data

Recall that the columns of our data frame should each be a single variable. This dataset violates this pinciple!
- One varaible, abundance, is spread across five colums.
- Another variable, species is not encoded in a column at all! Rather, the information in this variable is encoded in the column names
We can say that this dataset is in a wide format

head(trees)
##   pid year Abies.balsamifera Acer.saccharum Betula.papyrifera
## 1   1 1980                 4              0                 0
## 2   2 2006                40              0                 1
## 3   3 2006                36              0                 9
## 4   4 1980                 3              0                 0
## 5   5 1980                 4              0                 0
## 6   6 1980                16              0                 0
##   Populus.tremuloides Tsuga.canadensis
## 1                   0                0
## 2                   0                0
## 3                   0                0
## 4                   0                0
## 5                   0                0
## 6                   0                0

More data manipulation: tall vs. wide data

Recall that the columns of our data frame should each be a single variable. This dataset violates this pinciple!
- One varaible, abundance, is spread across five colums.
- Another variable, species is not encoded in a column at all! Rather, the information in this variable is encoded in the column names
We can say that this dataset is in a wide format
An add-on package, reshape2, can help us convert between wide and tall data frames with two functions:

From wide => tall: melt
From tall => wide: dcast

# install.packages("reshape2") # run this once, to install the package
library("reshape2")
trees_tall = melt(trees, id.vars = c("pid", "year"), 
                  variable.name = "species", 
                  value.name = "abundance")
head(trees_tall)
##   pid year           species abundance
## 1   1 1980 Abies.balsamifera         4
## 2   2 2006 Abies.balsamifera        40
## 3   3 2006 Abies.balsamifera        36
## 4   4 1980 Abies.balsamifera         3
## 5   5 1980 Abies.balsamifera         4
## 6   6 1980 Abies.balsamifera        16

More data manipulation: tall vs. wide data

Tall-format data are generally easier to work with (sometimes this is called “tidy” data in the R world)
Recommendation - keep your data tall, convert to wide as needed using dcast

# install.packages("reshape2") # run this once, to install the package
library("reshape2")
trees_tall = melt(trees, id.vars = c("pid", "year"), 
                  variable.name = "species", 
                  value.name = "abundance")
head(trees_tall)
##   pid year           species abundance
## 1   1 1980 Abies.balsamifera         4
## 2   2 2006 Abies.balsamifera        40
## 3   3 2006 Abies.balsamifera        36
## 4   4 1980 Abies.balsamifera         3
## 5   5 1980 Abies.balsamifera         4
## 6   6 1980 Abies.balsamifera        16

More data manipulation: tall vs. wide data

If our data are tall, we can easily perform all kinds of operations.
Here we cast the data to be wide again, but this time broadcasting both the species and year variables into columns and filling in zeros

trees_wide = dcast(trees_tall, pid ~ species + year, 
                   value.var = "abundance", fill = 0)
trees_wide[1:10, 1:5]
##    pid Abies.balsamifera_1975 Abies.balsamifera_1980 Abies.balsamifera_1985
## 1    1                      0                      4                      0
## 2    2                      0                      0                      0
## 3    3                      0                      0                      0
## 4    4                      0                      3                      0
## 5    5                      0                      4                      0
## 6    6                      0                     16                      0
## 7    7                      0                      0                      0
## 8    8                      0                      4                      0
## 9    9                      0                     30                      0
## 10  10                      0                      0                      0
##    Abies.balsamifera_1988
## 1                       0
## 2                       0
## 3                       0
## 4                       0
## 5                       0
## 6                       0
## 7                       0
## 8                       0
## 9                       0
## 10                      0

Taking subsets

If our data are tall, we can easily perform all kinds of operations.
subset gives you a new data frame that is a subset of the old one

balsam_fir_1980 = subset(trees_tall, 
                         species == "Abies.balsamifera" & 
                            year == 1980)
head(balsam_fir_1980)
##   pid year           species abundance
## 1   1 1980 Abies.balsamifera         4
## 4   4 1980 Abies.balsamifera         3
## 5   5 1980 Abies.balsamifera         4
## 6   6 1980 Abies.balsamifera        16
## 7   7 1980 Abies.balsamifera         0
## 8   8 1980 Abies.balsamifera         4

Boxplots on tall data

If our data are tall, we can easily perform all kinds of operations.
Boxplots can use a simple formula syntax

par(mar = c(8, 3, 0.1, 0.1)) # adjust the margins
# draw a boxplot of abundace, grouped by species and year
bpl = boxplot(abundance ~ species + year, 
            # outline = FALSE disables ploting outliers
              data = trees_tall, outline = FALSE, 
            # xlab disables the x-axis label, xaxt = "n" disables the x-axis
              xlab = "", xaxt = "n")

# Here we draw a custom x-axis with labels rotated 90 degrees
axis(side = 1, at = 1:length(bpl$names), 
     labels = bpl$names, cex.axis = 0.6, las = 2)

Aggregation

We can use aggregate to compute all kinds of summaries
- A bit like a supercharged tapply

# compute the mean of abundance, grouped by species and year
head(aggregate(abundance ~ species + year, 
               data = trees_tall, FUN = mean))
##               species year abundance
## 1   Abies.balsamifera 1975      56.7
## 2      Acer.saccharum 1975       0.0
## 3   Betula.papyrifera 1975       0.0
## 4 Populus.tremuloides 1975      10.3
## 5    Tsuga.canadensis 1975       0.0
## 6   Abies.balsamifera 1980      21.8

Paired (x,y) datasets

We often measure multiple variables about the same experimental unit
For example: each penguin has bill length and depth, flipper length, and body mass
For the simplest case, we can consider how two of these variables, flipper_length_mm and body_mass_g, relate to one another

# load penguin data
data(penguins, package = "palmerpenguins")
# convert to a data frame
penguins = as.data.frame(penguins)
# remove NAs
penguins = penguins[complete.cases(penguins),]
head(penguins)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
## 7  Adelie Torgersen           38.9          17.8               181        3625
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 5 female 2007
## 6   male 2007
## 7 female 2007

Scatterplots for paired data

We use scatterplots to examine paired variables like this
In R, this is done with the plot function.
Here we see there is strong dependence between the two variables

# pch: plotting character, the symbol to use for the dots
# bty: the type of box to draw around the plot, 'n' disables this
plot(flipper_length_mm ~ body_mass_g, data = penguins, pch = 16, bty = 'n',
     xlab = "Penguin Body Mass (g)", ylab = "Penguin Flipper Length (mm)")

Scatterplots for paired data

We use scatterplots to examine paired variables like this
In R, this is done with the plot function.
Here we see there is strong dependence between the two variables
If we use ggplot, we can also vary colours by categories (e.g., species)

ggplot(penguins) + 
    geom_point(aes(x = body_mass_g, y = flipper_length_mm, 
                   colour = species)) + 
    theme_minimal() + 
    xlab("Body Mass (g)") + ylab("Flipper Length (mm)")

Covariance

If two variables $x$ and $y$ are (independent), we can say that if $x$ is known, we have no additional information about $y$ (and vice-versa).
In other words, the distribution of values of y does not change with respect to x.

$p r (y | x) = p r (y)$

x = rnorm(1000)
y = rnorm(1000) # these variables are independent

plot(x, y, pch=16, bty='n', main = "Independent random variables")

Covariance

Two variables that covary do not have this independence property; the values in $x$ can be used to predict $y$ (with some error), and vice-versa.
The sample covariance ( ${cov}_{x y}$ ) looks similar to the equation for the sample variance, but relates to the amount of variation that is shared between $x$ and $y$

${cov}_{x y} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{n - 1}$

# in R:
c(cov(x1, y1),
  cov(x2, y2),
  cov(x3, y3))
## [1]  0.80012 -0.41042 -0.00761

Correlation

Like with the mean and standard deviation, we can rescale the covariance to make it easier to compare different datasets.
Scale-independent covariance is called correlation; for many datasets we use the Pearson correlation coefficient $ρ_{x y}$ .
Ranges from $- 1$ to $1$ . Zero means zero covariance. $- 1$ and $1$ indicates that $x$ and $y$ predict each other perfectly!

$\begin{aligned} {cov}_{x y} & = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{n - 1} \\ r_{x y} & = \frac{{cov}_{x y}}{s_{x} s_{y}} \end{aligned}$

# in R:
c(cor(x1, y1),
  cor(x2, y2),
  cor(x3, y3))
## [1]  0.89570 -0.39786 -0.00738

Correlation significance testing

$H_{0}$ : $r = 0$

$H_{A}$ : two sided ( $ρ \neq 0$ ) or one-sided ( $ρ > 0$ or $ρ < 0$ )

$r$ has a standard error:

$s_{r} = \sqrt{\frac{1 - r^{2}}{n - 2}}$ We can then compute a $t$ -statistic:

$t = \frac{r}{s}$

The probability that $t > α$ (i.e., use the CDF of the t distribution) is the p-value.

Correlation test in R

We can run this test on the penguin example

n = nrow(penguins)
(r = cor(penguins$body_mass_g, penguins$flipper_length_mm))
## [1] 0.873

(s_r = sqrt((1-r^2)/(n-2)))
## [1] 0.0268

(t_val = r/s_r)
## [1] 32.6

(2 * pt(t_val, n-2, lower.tail = FALSE)) # two-sided test
## [1] 3.13e-105

Correlation test in R

We can run this test on the penguin example

n = nrow(penguins)
(r = cor(penguins$body_mass_g, penguins$flipper_length_mm))
## [1] 0.873

(s_r = sqrt((1-r^2)/(n-2)))
## [1] 0.0268

(t_val = r/s_r)
## [1] 32.6

(2 * pt(t_val, n-2, lower.tail = FALSE)) # two-sided test
## [1] 3.13e-105

And equivalently, using a built-in function

with(penguins, 
     cor.test(body_mass_g, flipper_length_mm, alternative = "two.sided"))
## 
##  Pearson's product-moment correlation
## 
## data:  body_mass_g and flipper_length_mm
## t = 33, df = 331, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.845 0.896
## sample estimates:
##   cor 
## 0.873

Correlation test: assumptions

Data must be at least interval scale
- ordinal data: Spearman rank correlation (avialable in cor.test and cor)
- nominal data: Association test (prop.test) or $χ^{2}$ test (chisq.test)
Population is distributed bivariate normal (or $n$ is sufficiently large)

Correlation pitfalls

Test is misleading if relationship is nonlinear

Correlation pitfalls

Heterogeneity of subgroups

Spearman correlation

Used when Pearson correlation assumptions are violated
- non-normal data
- non-linear (but monotonic) relationships
- ordinal data

cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1, df = 148, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2447  0.0734
## sample estimates:
##     cor 
## -0.0879

Spearman correlation

Used when Pearson correlation assumptions are violated
- non-normal data
- non-linear (but monotonic) relationships
- ordinal data

The math is simple: rank transform x and y, then compute Pearson correlation.

cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1, df = 148, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2447  0.0734
## sample estimates:
##     cor 
## -0.0879

cor.test(x, y, method = 'spearman')
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 8e+05, p-value = 9e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## -0.357

Tests of association: 2 × 1 Tables

Mendel’s peas

Take homozygous violet (dominant) and white (recessive plants)
$F_{1}$ plants are 100% violet
Theory says that $F_{2}$ plants should be 75% violet, 25% white

Observation:

colour	F2
violet	705
white	224

source

Tests of association: 2 × 1 Tables

Mendel’s peas

Take homozygous violet (dominant) and white (recessive plants)
$F_{1}$ plants are 100% violet
Theory says that $F_{2}$ plants should be 75% violet, 25% white

Observation:

colour	F2
violet	705
white	224

$H_{0}$ : Inheritance is Mendelian (violet:white = 3:1)

$H_{A}$ : Inheritance is not exactly Mendelian

# Test of proportions against a null hypothesis
# For small sample sizes, use binom.test
counts = matrix(c(705, 224), ncol = 2)
prop.test(counts, p = 0.75, alternative = "two.sided")
## 
##  1-sample proportions test with continuity correction
## 
## data:  counts, null probability 0.75
## X-squared = 0.3, df = 1, p-value = 0.6
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
##  0.730 0.786
## sample estimates:
##     p 
## 0.759

Tests of association: n × n Tables

Nesting holes of black-backed woodpeckers.

woodpecker = read.csv("../datasets/woodpecker.csv")
head(woodpecker)
##   forest_type nest_tree
## 1      burned     birch
## 2      burned     birch
## 3      burned jack pine
## 4      burned     aspen
## 5      burned     birch
## 6      burned jack pine


table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

We want to test for an association between the two variables (forest type and nest tree)

$H_{0}$ : Nesting tree is not associated with forest type

$H_{A}$ : Nest tree is associated with forest type

Chi-squared test

Nesting holes of black-backed woodpeckers.

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23


table(woodpecker)/rowSums(table(woodpecker))
##            nest_tree
## forest_type  aspen  birch jack pine
##    burned   0.2500 0.6667    0.0833
##    unburned 0.4143 0.2571    0.3286

We can use a $χ^{2}$ -test to test for this association

$H_{0}$ : Nesting tree is not associated with forest type

$H_{A}$ : Nest tree is associated with forest type

Chi-squared test

Nesting holes of black-backed woodpeckers.

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23


table(woodpecker)/rowSums(table(woodpecker))
##            nest_tree
## forest_type  aspen  birch jack pine
##    burned   0.2500 0.6667    0.0833
##    unburned 0.4143 0.2571    0.3286

We can use a $χ^{2}$ -test to test for this association

$H_{0}$ : Nesting tree is not associated with forest type

$H_{A}$ : Nest tree is associated with forest type

# for 2x2 tables with small sample sizes: Fisher's exact test
# fisher.test()
with(woodpecker, chisq.test(forest_type, nest_tree))
## 
##  Pearson's Chi-squared test
## 
## data:  forest_type and nest_tree
## X-squared = 14, df = 2, p-value = 0.001

Visualisation: Categorical Data

You see this a lot. When should you do it? NEVER

One problem: your brain is bad at converting angles to numbers
This will over-emphasize large values
Comparisons among intermediate groups is challenging

Visualisation: Categorical Data

This is almost as bad, and sadly much more common

Visualisation: Categorical Data

Barplots, or proportional bars for counts within categories

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23


woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
                fill = forest_type)) + theme_minimal()
woodp_plot = woodp_plot + geom_bar(width = 0.5)
woodp_plot

Stacked bars are “unfair” — easiest to compare the “rooted” class (unburned).

Visualisation: Categorical Data

Barplots, or proportional bars for counts within categories

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23


woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
                fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5, 
                    position=position_dodge())
woodp_plot = woodp_plot + xlab("Nest Tree Type") +
    theme_minimal() + labs(fill = "Forest Type")
woodp_plot

Side-by-side bars allow us to compare all categories on equal footing.

Visualisation: Ordinal Data

Scatterplots become less useful.

birddiv = read.csv("../datasets/birddiv.csv")
bird_plot = ggplot(birddiv, aes(x=forest_frag, 
                y = richness, colour = bird_type)) + 
                geom_point() + theme_minimal()
head(birddiv)
##   Grow.degd For.cover  NDVI bird_type richness forest_frag
## 1       330      99.9 60.38    forest        8           1
## 2       330       0.0 22.88    forest        1           0
## 3       330      38.3 11.86    forest        5           3
## 4       330      11.4 19.07    forest        7           7
## 5       330       0.0  2.12    forest        2           0
## 6       170     100.0 54.03    forest        7           1

Visualisation: Ordinal Data

Adding jitter can sometimes improve things

bird_plot = ggplot(birddiv, aes(x=forest_frag, 
                y = richness, colour = bird_type)) + 
                geom_jitter() + theme_minimal()

Visualisation: Ordinal Data

Another solution: ordered boxplots

bird_plot = ggplot(birddiv, aes(x=as.factor(forest_frag), 
                y = richness, fill = bird_type)) + 
                geom_boxplot() + theme_minimal()
bird_plot = bird_plot + xlab("Forest Fragmentation")

More Basics

More data manipulation

More data manipulation: tall vs. wide data

More data manipulation: tall vs. wide data

More data manipulation: tall vs. wide data

More data manipulation: tall vs. wide data

Taking subsets

Boxplots on tall data

Aggregation

Paired (x,y) datasets

Scatterplots for paired data

Scatterplots for paired data

Covariance

Covariance

Correlation

Correlation significance testing

Correlation test in R

Correlation test in R

Correlation test: assumptions

Correlation pitfalls

Correlation pitfalls

Spearman correlation

Spearman correlation

Tests of association: 2 × 1 Tables

Mendel’s peas

Tests of association: 2 × 1 Tables

Mendel’s peas

Tests of association: n × n Tables

Nesting holes of black-backed woodpeckers.

Chi-squared test

Nesting holes of black-backed woodpeckers.

Chi-squared test

Nesting holes of black-backed woodpeckers.

Visualisation: Categorical Data

Visualisation: Categorical Data

Visualisation: Categorical Data

Visualisation: Categorical Data

Visualisation: Ordinal Data

Visualisation: Ordinal Data

Visualisation: Ordinal Data