The Basics

Lauren Talluto

13.01.2025

Course Introduction

Course Introduction

Course structure

Website and course materials

Introduction to R

Helpful vocabulary

Any code that is given to you will assume your working directory is set to the unit you are working on. So, for today’s exercises, you should run the following in the console: setwd("unit_1").

Recommendations

Variables

Recommendations

# Comments in R start with the # symbol
# anything after # will not be executed
# Comments are a useful way to annotate your code so you know what is 
# happening

# Legal variable names
x = 1
y0 = 5
time_of_day = "20:15"
dayOfWeek <- "Monday"

# bad!
# d is the diversity in our site, in species
d = 8

# better!
site_diversity = 8


# errors
0y = 5
my name = "Lauren"
## Error: <text>:21:2: unexpected symbol
## 20: # errors
## 21: 0y
##      ^

Data types

Operators

We use operators to perform computations on variables and constants

# assignment
x = 5

# math
x + 2
## [1] 7
(3 + x) * 2
## [1] 16
3^2
## [1] 9
10 %% 3
## [1] 1

# comparison
x < 7
## [1] TRUE
x == 5.1
## [1] FALSE

Logic

# and: both must be true
TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE

# or: only one needs to be true
TRUE | FALSE
## [1] TRUE
FALSE | FALSE
## [1] FALSE

# not: returns the opposite
!FALSE
## [1] TRUE

Logic

# making decisions
x = 2^15 + 1
if(x > 100) {
    print("x is big")
} else {
    print("x is small")
}
## [1] "x is big"

if(x > 100 & x %% 2 == 0) {
    print("x is big and even")
} else {
    print("x is either small or an odd number")
}
## [1] "x is either small or an odd number"

Functions

Mathematical functions

Get help on a function with ? or help(), for example: ?log or help(log). If you don´t know a function´s name, you can search for a (likely/suspected) string in its name with ??.

x = 5

# The print() function takes one or more arguments
#.      in this case the variable x
# It returns no value, but has the side effect of printing the 
# value of x to the screen
print(x)
## [1] 5

# compute the natural logarithm of x, store it in y, then print the result
# How do I know it is the natural log?
y = log(x)
print(y)
## [1] 1.609438

# functions can take multiple arguments, and arguments can be named
# here we change the base of the logarithm
log(x, base = 10)
## [1] 0.69897

Basic data structures: vectors

# The c() function stands for concatenate
# it groups items together into a vector
(five_numbers = c(3, 2, 8.6, 4, 9.75))
## [1] 3.00 2.00 8.60 4.00 9.75

# create a vector of type character
colours = c("red", "red", "blue", "green", "red")

# can also create sequences and repeats directly into a vector
(one_to_ten = 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
(count_by_threes = seq(0, 15, 3))
## [1]  0  3  6  9 12 15
rep(0, 3)
## [1] 0 0 0

Basic data structures: vectors


# Get part of a vector using []
five_numbers[3]
## [1] 8.6
five_numbers[3:5]
## [1] 8.60 4.00 9.75
five_numbers[c(1,4)]
## [1] 3 4

Basic data structures: vectors

# get vector length
length(five_numbers)
## [1] 5

# testing and coersion
x = 5.5
is.vector(x) ## even a single value is a vector in R
## [1] TRUE
is.integer(x)
## [1] FALSE
as.integer(x)
## [1] 5

Basic data structures: data frames

# Load a dataset named 'penguins' and 
# convert it to a data frame
# this dataset comes from a package, "palmerpenguins"
data(penguins, package = "palmerpenguins")
penguins = as.data.frame(penguins)
head(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Working with data frames

str(penguins)
## 'data.frame':    344 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Working with data frames

# data frame dimensions
nrow(penguins)
## [1] 344
ncol(penguins)
## [1] 8
dim(penguins)
## [1] 344   8

Working with data frames

# accessing a variable by name
penguins$bill_length_mm

# accessing a variable by position
penguins[,1] # [,1] get every row in the first column
# accessing a variable by name, and subsetting rows
penguins[1:10,"bill_length_mm"] # another way to access by name, gets the first 10 entries
##  [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0

Code Style

Working with data frames

# There is an NA in this vector
penguins[1:10,"bill_length_mm"] 
##  [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0

# getting rid of NA values
# DANGER - this will erase the original penguins, make sure you mean to do this!
penguins = penguins[complete.cases(penguins),]

Working with data frames

The variable species is a factor, representing a categorical variable with a fixed set of levels.

str(penguins)
## 'data.frame':    333 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Useful functions for factors include levels and table.

levels(penguins$species)
## [1] "Adelie"    "Chinstrap" "Gentoo"
table(penguins$species)
## 
##    Adelie Chinstrap    Gentoo 
##       146        68       119

Graphics in R

R currently has two dominant graphical engines

Base graphics

ggplot2

Histograms: Base graphics

# generate some random numbers from the lognormal distribution
my_var = rlnorm(1000, log(10), 0.5)
hist(my_var)

Histograms: Base graphics

The defaults are not very nice, so lets improve things

hist(my_var, main = "", xlab="Variable of interest", ylab = "Frequency", 
     breaks = 20, col="#999999", border=NA)

Colors

head(colors())
## [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"

Colors

Populations & Samples

Populations & Samples

Populations & Samples

Considerations when sampling:

Describing populations and samples

Summary statistics: location

The population mean (\(\mu\)) can be approximated with the sample mean:

\[ \mu \approx \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \]

mean(my_var)
## [1] 11.68838

Summary statistics: location

Other location statistics

The mean can be strongly influenced by outliers.

median(my_var)
## [1] 10.26268

Summary statistics: location

Other location statistics

The mean can be strongly influenced by outliers.

Summary statistics: location

Other location statistics

The mean can be strongly influenced by outliers.

# mode
mode(my_var) # wrong!
## [1] "numeric"

# we can use the histogram function to approximate the sample mode
# WARNING: changing the number of breaks can have a 
# large impact on the results
my_hist = hist(my_var, breaks = 30, plot = FALSE)

# the mids variable gives you the midpoint of each bin
# counts gives you the count each bin
# cbind shows them together in columns
cbind(bar_midpoint = my_hist$mids, count = my_hist$counts)
##       bar_midpoint count
##  [1,]            1     2
##  [2,]            3    31
##  [3,]            5   116
##  [4,]            7   161
##  [5,]            9   170
##  [6,]           11   147
##  [7,]           13    92
##  [8,]           15    84
##  [9,]           17    48
## [10,]           19    50
## [11,]           21    31
## [12,]           23    24
## [13,]           25    15
## [14,]           27     5
## [15,]           29    10
## [16,]           31     6
## [17,]           33     2
## [18,]           35     4
## [19,]           37     0
## [20,]           39     1
## [21,]           41     0
## [22,]           43     0
## [23,]           45     1

Summary statistics: location

Other location statistics

The mean can be strongly influenced by outliers.

# use which.max to find out which count index (which bar) is the largest
# then take that midpoint -- this is the mode
(mode_index = which.max(my_hist$counts))
## [1] 5
my_hist$mids[mode_index]
## [1] 9

Summary statistics: location

Other location statistics

Comparing variables

We can compare variables in a way that is location independent by centering (subtracting the mean)

c(min(my_var), max(my_var))
## [1]  1.670021 44.760331
range(my_var)
## [1]  1.670021 44.760331

quantile(my_var, 0.4)
##      40% 
## 9.036879

# Centre the variable
# note! arithmetic operators like `-` are vectorized!
my_var_ctr = my_var - mean(my_var)
mean(my_var_ctr)
## [1] 5.448975e-16

Summary statistics: dispersion (or scale)

\[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i-\mu)^2 \]

We can estimate \(\sigma^2\) using the sample variance:

\[ \sigma^2 \approx s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \]

It is convenient to talk about the scale of \(x\) in the same units as \(x\) itself, so we use the (population or sample) standard deviation:

\[ \sigma = \sqrt{\sigma^2} \approx s = \sqrt{s^2} \]

Summary statistics: dispersion (or scale)

\[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i-\mu)^2 \]

We can estimate \(\sigma^2\) using the sample variance:

\[ \sigma^2 \approx s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \]

It is convenient to talk about the scale of \(x\) in the same units as \(x\) itself, so we use the (population or sample) standard deviation:

\[ \sigma = \sqrt{\sigma^2} \approx s = \sqrt{s^2} \]

# Population variance -- biased if x is a sample!
# one-line version:
# sum((my_var - mean(my_var))^2)/length(my_var)

N = length(my_var)
sq_dev = (my_var - mean(my_var))^2
sum(sq_dev)/N
## [1] 36.66053

# These R functions always produce the sample variance and sd
var(my_var)
## [1] 36.69723
sd(my_var)
## [1] 6.057824

Summary statistics: dispersion (or scale)

Other dispersion statistics

# the range function gives you the min and max
# Take the difference to get the statistical range
diff(range(my_var))
## [1] 43.09031
max(my_var) - min(my_var)
## [1] 43.09031

IQR(my_var)
## [1] 7.181614
quantile(my_var, 0.75) - quantile(my_var, 0.25)
##      75% 
## 7.181614

# coefficient of variation can be done manually
sd(my_var)/mean(my_var)
## [1] 0.5182773

Summary statistics: higher moments

Asymmetry: skewness

Is the distribution weighted to one side or the other?

\[ \mathrm{skewness} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^3}{(n-1)s^3} \]

Tailedness: kurtosis

How fat are the tails relative to a normal distribution?

\[ \mathrm{kurtosis} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^4}{(n-1)s^4} \]

Another way to visualise a sample: boxplots

boxplot(body_mass_g ~ species, data = penguins, boxwex=0.4, notch = TRUE)

More on boxplots

boxplot(body_mass_g ~ species, data = penguins, boxwex=0.4, notch = TRUE)

More on boxplots

For more complex groupings, better to use ggplot

peng_bplot = ggplot(penguins, aes(y = body_mass_g, x = sex, 
                            fill = species)) + 
    geom_boxplot(notch = TRUE) + 
    theme_minimal() + ylab("Body Mass (g)") + xlab("") +
    scale_fill_brewer(palette = "Paired") 

peng_bplot

Programming mini-lesson: repeating yourself

# bad!
(col_means = c(
  mean(penguins[,3]),
  mean(penguins[,4]),
  mean(penguins[,5]),
  mean(penguins[,3])  # it's easy to introduce mistakes this way!
))
## [1]  43.99279  17.16486 200.96697  43.99279

Applying functions

# bad!
(col_means = c(
  mean(penguins[,3]),
  mean(penguins[,4]),
  mean(penguins[,5]),
  mean(penguins[,3])  # it's easy to introduce mistakes this way!
))
## [1]  43.99279  17.16486 200.96697  43.99279

# better
# first we find out which columns are numeric
numeric_columns = sapply(penguins, is.numeric)

# then we get the means of those columns
(col_means = sapply(penguins[,numeric_columns], mean))
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##          43.99279          17.16486         200.96697        4207.05706 
##              year 
##        2008.04204

Applying functions

sapply(penguins[,numeric_columns], range)
##      bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## [1,]           32.1          13.1               172        2700 2007
## [2,]           59.6          21.5               231        6300 2009
# here, we pass an additional argument named probs to quantile
# see ?quantile for what this does
sapply(penguins[,numeric_columns], quantile, probs = c(0.25, 0.5, 0.75))
##     bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## 25%           39.5          15.6               190        3550 2007
## 50%           44.5          17.3               197        4050 2008
## 75%           48.6          18.7               213        4775 2009

Tabular applies

# with: use the variables found in the data.frame penguins
with(penguins, 
     # compute the mean of bill_length_mm for each of the three species
     tapply(bill_length_mm, species, mean)
)
##    Adelie Chinstrap    Gentoo 
##  38.82397  48.83382  47.56807

Probability distributions

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

Exercise: Try to simulate this process in R. What does the distribution of locations look like after 10 steps? After 100? What is the long-run distribution with many steps and many players?

# if you want to draw the soccer pitch, you can read the function for it
source("r/football.r")

# 100 players, all on the centre line (value = 0)
nplayers = 100
players = rep(0, nplayers)

# define the size of the steps for each coin flip
heads = 0.5
tails = -0.5

# simulate one step for each player
steps = sample(c(heads, tails), nplayers, replace = TRUE)

# update player locations
players = players + steps

# visualise, if desired
draw_pitch(players)
hist(players)

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

Exercise: Try to simulate this process in R. What does the distribution of locations look like after 10 steps? After 100? What is the long-run distribution with many steps and many players?

# simulate 500 steps with 1000 players
nplayers = 1000
players = rep(0, nplayers)

for(i in 1:500) {
    steps = sample(c(heads, tails), nplayers, replace = TRUE)
    players = players + steps
}

# visualise, if desired
draw_pitch(players)
hist(players)

Probability distributions: PDFs

hist(players, breaks=30, col="gray", main = "", freq=FALSE)

Probability distributions: PDFs

hist(players, breaks=30, col="gray", main = "", freq=FALSE)
lines(density(players, adjust=1.5), col='red', lwd=2)

Normal distributions

hist(players, breaks=40, col="gray", main = "", freq=FALSE)
lines(density(players, adjust=1.5), col='red', lwd=2)
mu = mean(players)
sig = sd(players)
x_norm = seq(min(players), max(players), length.out = 400)
y_norm = dnorm(x_norm, mu, sig)
lines(x_norm, y_norm, lwd=2, col='blue')
legend("topright", legend=c(paste("sample mean =", round(mu, 2)), 
                            paste("sample sd =", round(sig, 2))), lwd=0, bty='n')

Normal distributions

Normal distributions

\[ \mathcal{f}(x) = \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} \]

Normal distributions: area under the curve

Normal distributions: area under the curve

Normal distributions: CDF

\[ \mathcal{g}(x) = \int_{-\infty}^{x} \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} dx \]

Distributions in R

PDF: what is the probability density when \(x=3\) (the height of the bell curve)

dnorm(x = 3, mean = mean(players), sd = sd(players))
## [1] 0.03452526

Distributions in R

PDF: what is the probability density when \(x=3\) (the height of the bell curve)

dnorm(x = 3, mean = 0, sd = 1)
## [1] 0.004431848

CDF: what is the cumulative probability when \(x=q\)

(area under the bell curve from \(-\infty\) to \(q\))

(probability of observing a value < \(q\))

pnorm(q = mean(players) - 1.96 * sd(players), 
      mean = mean(players), sd = sd(players))
## [1] 0.0249979

Distributions in R

Quantiles: what is the value of \(x\), such that the probability of observing x or smaller is \(p\)

(inverse of the CDF/pnorm)

qnorm(p = 0.025, mean = 0, sd = 1)
## [1] -1.959964

Distributions in R

Quantiles: what is the value of \(x\), such that the probability of observing x or smaller is \(p\)

(inverse of the CDF/pnorm)

qnorm(p = 0.025, mean = 0, sd = 1)
## [1] -1.959964

RNG: Random number generator, produces \(n\) random numbers from the desired distribution

rnorm(n = 10, mean= 0, sd = 1)
##  [1] -1.7879356  1.6209792 -0.9772414  0.7117817 -0.9274279 -0.5008930
##  [7] -1.9062491 -0.1495901 -0.7830480  0.2145755

Sampling error

Standard error of the mean

\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}} \]

Central limit theorem