Lauren Talluto
15.01.2024
Today we will look at three useful functions for manipulating data frames in more complicated/interesting cases.
abundance
, is spread across five
colums.species
is not encoded in a column at
all! Rather, the information in this variable is encoded in the
column namesabundance
, is spread across five
colums.species
is not encoded in a column at
all! Rather, the information in this variable is encoded in the
column namesreshape2
, can help us convert
between wide and tall data frames with
two functions:melt
dcast
# install.packages("reshape2") # run this once, to install the package
library("reshape2")
trees_tall = melt(trees, id.vars = c("pid", "year"),
variable.name = "species",
value.name = "abundance")
head(trees_tall)
## pid year species abundance
## 1 1 1980 Abies.balsamifera 4
## 2 2 2006 Abies.balsamifera 40
## 3 3 2006 Abies.balsamifera 36
## 4 4 1980 Abies.balsamifera 3
## 5 5 1980 Abies.balsamifera 4
## 6 6 1980 Abies.balsamifera 16
dcast
# install.packages("reshape2") # run this once, to install the package
library("reshape2")
trees_tall = melt(trees, id.vars = c("pid", "year"),
variable.name = "species",
value.name = "abundance")
head(trees_tall)
## pid year species abundance
## 1 1 1980 Abies.balsamifera 4
## 2 2 2006 Abies.balsamifera 40
## 3 3 2006 Abies.balsamifera 36
## 4 4 1980 Abies.balsamifera 3
## 5 5 1980 Abies.balsamifera 4
## 6 6 1980 Abies.balsamifera 16
species
and year
variables into
columns and filling in zerostrees_wide = dcast(trees_tall, pid ~ species + year,
value.var = "abundance", fill = 0)
trees_wide[1:10, 1:5]
## pid Abies.balsamifera_1975 Abies.balsamifera_1980 Abies.balsamifera_1985
## 1 1 0 4 0
## 2 2 0 0 0
## 3 3 0 0 0
## 4 4 0 3 0
## 5 5 0 4 0
## 6 6 0 16 0
## 7 7 0 0 0
## 8 8 0 4 0
## 9 9 0 30 0
## 10 10 0 0 0
## Abies.balsamifera_1988
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
subset
gives you a new data frame that is a subset of
the old onebalsam_fir_1980 = subset(trees_tall,
species == "Abies.balsamifera" &
year == 1980)
head(balsam_fir_1980)
## pid year species abundance
## 1 1 1980 Abies.balsamifera 4
## 4 4 1980 Abies.balsamifera 3
## 5 5 1980 Abies.balsamifera 4
## 6 6 1980 Abies.balsamifera 16
## 7 7 1980 Abies.balsamifera 0
## 8 8 1980 Abies.balsamifera 4
par(mar = c(8, 3, 0.1, 0.1)) # adjust the margins
# draw a boxplot of abundace, grouped by species and year
bpl = boxplot(abundance ~ species + year,
# outline = FALSE disables ploting outliers
data = trees_tall, outline = FALSE,
# xlab disables the x-axis label, xaxt = "n" disables the x-axis
xlab = "", xaxt = "n")
# Here we draw a custom x-axis with labels rotated 90 degrees
axis(side = 1, at = 1:length(bpl$names),
labels = bpl$names, cex.axis = 0.6, las = 2)
aggregate
to compute all kinds of summaries
tapply
# compute the mean of abundance, grouped by species and year
head(aggregate(abundance ~ species + year,
data = trees_tall, FUN = mean))
## species year abundance
## 1 Abies.balsamifera 1975 56.7
## 2 Acer.saccharum 1975 0.0
## 3 Betula.papyrifera 1975 0.0
## 4 Populus.tremuloides 1975 10.3
## 5 Tsuga.canadensis 1975 0.0
## 6 Abies.balsamifera 1980 21.8
flipper_length_mm
and body_mass_g
, relate to
one another# load penguin data
data(penguins, package = "palmerpenguins")
# convert to a data frame
penguins = as.data.frame(penguins)
# remove NAs
penguins = penguins[complete.cases(penguins),]
head(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## sex year
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 5 female 2007
## 6 male 2007
## 7 female 2007
plot
function.# pch: plotting character, the symbol to use for the dots
# bty: the type of box to draw around the plot, 'n' disables this
plot(flipper_length_mm ~ body_mass_g, data = penguins, pch = 16, bty = 'n',
xlab = "Penguin Body Mass (g)", ylab = "Penguin Flipper Length (mm)")
plot
function.ggplot
, we can also vary colours by
categories (e.g., species)ggplot(penguins) +
geom_point(aes(x = body_mass_g, y = flipper_length_mm,
colour = species)) +
theme_minimal() +
xlab("Body Mass (g)") + ylab("Flipper Length (mm)")
x = rnorm(1000)
y = rnorm(1000) # these variables are independent
plot(x, y, pch=16, bty='n', main = "Independent random variables")
The probability that
We can run this test on the penguin example
We can run this test on the penguin example
And equivalently, using a built-in function
with(penguins,
cor.test(body_mass_g, flipper_length_mm, alternative = "two.sided"))
##
## Pearson's product-moment correlation
##
## data: body_mass_g and flipper_length_mm
## t = 33, df = 331, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.845 0.896
## sample estimates:
## cor
## 0.873
cor.test
and cor
)prop.test
) or chisq.test
)Heterogeneity of subgroups
Observation:
colour | F2 |
---|---|
violet | 705 |
white | 224 |
Observation:
colour | F2 |
---|---|
violet | 705 |
white | 224 |
# Test of proportions against a null hypothesis
# For small sample sizes, use binom.test
counts = matrix(c(705, 224), ncol = 2)
prop.test(counts, p = 0.75, alternative = "two.sided")
##
## 1-sample proportions test with continuity correction
##
## data: counts, null probability 0.75
## X-squared = 0.3, df = 1, p-value = 0.6
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
## 0.730 0.786
## sample estimates:
## p
## 0.759
woodpecker = read.csv("../datasets/woodpecker.csv")
head(woodpecker)
## forest_type nest_tree
## 1 burned birch
## 2 burned birch
## 3 burned jack pine
## 4 burned aspen
## 5 burned birch
## 6 burned jack pine
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
We want to test for an association between the two variables (forest type and nest tree)
We can use a
We can use a
You see this a lot. When should you do it? NEVER
This is almost as bad, and sadly much more common
Barplots, or proportional bars for counts within categories
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
fill = forest_type)) + theme_minimal()
woodp_plot = woodp_plot + geom_bar(width = 0.5)
woodp_plot
Stacked bars are “unfair” — easiest to compare the “rooted” class (unburned).
Barplots, or proportional bars for counts within categories
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5,
position=position_dodge())
woodp_plot = woodp_plot + xlab("Nest Tree Type") +
theme_minimal() + labs(fill = "Forest Type")
woodp_plot
Side-by-side bars allow us to compare all categories on equal footing.
Scatterplots become less useful.
birddiv = read.csv("../datasets/birddiv.csv")
bird_plot = ggplot(birddiv, aes(x=forest_frag,
y = richness, colour = bird_type)) +
geom_point() + theme_minimal()
head(birddiv)
## Grow.degd For.cover NDVI bird_type richness forest_frag
## 1 330 99.9 60.38 forest 8 1
## 2 330 0.0 22.88 forest 1 0
## 3 330 38.3 11.86 forest 5 3
## 4 330 11.4 19.07 forest 7 7
## 5 330 0.0 2.12 forest 2 0
## 6 170 100.0 54.03 forest 7 1
Adding jitter can sometimes improve things
Another solution: ordered boxplots
Space, Right Arrow or swipe left to move to next slide, click help below for more details