Helpful vocabulary

console: A window where you can type commands and view (text) results and output
interpreter: Software that translates R commands into instructions for your computer in real time
script: a text file containing a program, or series of commands
- can be run interactively (sending commands one at a time to the console)
- or in batch mode (all commands run, one after the other)
working directory: location on your computer where R will search for files, data, etc.

Any code that is given to you will assume your working directory is set to the unit you are working on. So, for today’s exercises, you should run the following in the console: setwd("unit_1").

Recommendations

Create an RStudio project to organize your work for the course (File => New Project).
Store all files in the project folder (your project will be self-contained).
Save related commands together in a script.
Annotate your code with comments. Write more comments than you need.
- In R, anything after the # character is a comment and will be ignored by the interpreter.

Variables

A variable is a name that points to some data.
- Variable names can contain lower- or upper-case letters, numbers, and the _ symbol.
- Names must start with letters and (when possible) should be descriptive
Variables are given values by assignment using either the = or <- symbol

Recommendations

Use descriptive variable names instead of comments.
Avoid 1- and 2- letter names.
Separate words with underscores.
Use a consistent assignment operator (= or <-, not both)

# Comments in R start with the # symbol
# anything after # will not be executed
# Comments are a useful way to annotate your code so you know what is 
# happening

# Legal variable names
x = 1
y0 = 5
time_of_day = "20:15"
dayOfWeek <- "Monday"

# bad!
# d is the diversity in our site, in species
d = 8

# better!
site_diversity = 8


# errors
0y = 5
my name = "Lauren"
## Error: <text>:21:2: unexpected symbol
## 20: # errors
## 21: 0y
##      ^

Data types

numeric: integers and floating-point (decimal) numbers
logical: yes/no, true/false data; in R represented by the special values TRUE and FALSE
- T and F (no quotes) can be used as shortcuts for TRUE/FALSE, but you should avoid this!
character: strings, text
factor: special variable type for categorical (nominal & ordinal) data

Operators

We use operators to perform computations on variables and constants

Assignment: =, <-
Math: +, -, *, /, ^, %%
Comparison: ==, !=, <, >, <=, >=

# assignment
x = 5

# math
x + 2
## [1] 7

(3 + x) * 2
## [1] 16

3^2
## [1] 9

10 %% 3
## [1] 1


# comparison
x < 7
## [1] TRUE

x == 5.1
## [1] FALSE

Logic

Logical comparisons can be made with and (&), or (|), and not (!).

# and: both must be true
TRUE & TRUE
## [1] TRUE

TRUE & FALSE
## [1] FALSE

FALSE & FALSE
## [1] FALSE


# or: only one needs to be true
TRUE | FALSE
## [1] TRUE

FALSE | FALSE
## [1] FALSE


# not: returns the opposite
!FALSE
## [1] TRUE

Logic

Logical comparisons can be made with and (&), or (|), and not (!).
You can use logic to help your script make decision based on the outcome of a computation

# making decisions
x = 2^15 + 1
if(x > 100) {
    print("x is big")
} else {
    print("x is small")
}
## [1] "x is big"


if(x > 100 & x %% 2 == 0) {
    print("x is big and even")
} else {
    print("x is either small or an odd number")
}
## [1] "x is either small or an odd number"

Functions

Functions take a series of arguments (which can be variables or constants).
The function performs computations on the arguments.
Functions can return a result, have side-effects, or both.

Mathematical functions

sin(), cos()
log(), exp(), sqrt()

Get help on a function with ? or help(), for example: ?log or help(log). If you don´t know a function´s name, you can search for a (likely/suspected) string in its name with ??.

x = 5

# The print() function takes one or more arguments
#.      in this case the variable x
# It returns no value, but has the side effect of printing the 
# value of x to the screen
print(x)
## [1] 5


# compute the natural logarithm of x, store it in y, then print the result
# How do I know it is the natural log?
y = log(x)
print(y)
## [1] 1.609438


# functions can take multiple arguments, and arguments can be named
# here we change the base of the logarithm
log(x, base = 10)
## [1] 0.69897

Basic data structures: vectors

A data structure is a way of organizing multiple pieces of data into a single variable
Can get very complex; we will start with the two most important types
A vector holds one or more values of a single data type
- Vectors can be created using the concatenate function c()

# The c() function stands for concatenate
# it groups items together into a vector
(five_numbers = c(3, 2, 8.6, 4, 9.75))
## [1] 3.00 2.00 8.60 4.00 9.75


# create a vector of type character
colours = c("red", "red", "blue", "green", "red")

# can also create sequences and repeats directly into a vector
(one_to_ten = 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10

(count_by_threes = seq(0, 15, 3))
## [1]  0  3  6  9 12 15

rep(0, 3)
## [1] 0 0 0

Basic data structures: vectors

A vector holds one or more values of a single data type
- Vectors can be created using the concatenate function c()
- You can subset vectors using []


# Get part of a vector using []
five_numbers[3]
## [1] 8.6

five_numbers[3:5]
## [1] 8.60 4.00 9.75

five_numbers[c(1,4)]
## [1] 3 4

Basic data structures: vectors

A vector holds one or more values of a single data type
- Vectors can be created using the concatenate function c()
- You can subset vectors using []
- Other helpful functions can let you examine vectors

# get vector length
length(five_numbers)
## [1] 5


# testing and coersion
x = 5.5
is.vector(x) ## even a single value is a vector in R
## [1] TRUE

is.integer(x)
## [1] FALSE

as.integer(x)
## [1] 5

Basic data structures: data frames

A data.frame holds tabular data
- Each row of a data frame is a single case (i.e., an observation)
- Each column of a data frame is a single variable (i.e., a vector, all the same data type)
head shows the first few rows of a data frame

# Load a dataset named 'penguins' and 
# convert it to a data frame
# this dataset comes from a package, "palmerpenguins"
data(penguins, package = "palmerpenguins")
penguins = as.data.frame(penguins)
head(penguins)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

Working with data frames

View shows the data frame in a graphical window
str gives you a summary of the structure of the data

str(penguins)
## 'data.frame':    344 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Working with data frames

nrow, ncol, dim give you dimensions of the data frame

# data frame dimensions
nrow(penguins)
## [1] 344

ncol(penguins)
## [1] 8

dim(penguins)
## [1] 344   8

Working with data frames

Access an individual variable using $ or []

# accessing a variable by name
penguins$bill_length_mm

# accessing a variable by position
penguins[,1] # [,1] get every row in the first column

# accessing a variable by name, and subsetting rows
penguins[1:10,"bill_length_mm"] # another way to access by name, gets the first 10 entries
##  [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0

Code Style

Here we show 3 ways of accessing a column ($bill_length_mm, [,1], [, 'bill_length_mm']).
Are any of these preferable to the others? Why?

Working with data frames

We saw on the last slide there are some NAs in the data, representing missing values.
Many analyses fail with NAs. We must either remove them or instruct our analysis how to deal with them.
Here we use complete.cases, which returns the rown numbers of all rows that do not contain any NAs.

# There is an NA in this vector
penguins[1:10,"bill_length_mm"] 
##  [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0


# getting rid of NA values
# DANGER - this will erase the original penguins, make sure you mean to do this!
penguins = penguins[complete.cases(penguins),]

Working with data frames

The variable species is a factor, representing a categorical variable with a fixed set of levels.

str(penguins)
## 'data.frame':    333 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Useful functions for factors include levels and table.

levels(penguins$species)
## [1] "Adelie"    "Chinstrap" "Gentoo"

table(penguins$species)
## 
##    Adelie Chinstrap    Gentoo 
##       146        68       119

Graphics in R

R currently has two dominant graphical engines

Base graphics

Built in to R.
Simple to use
Relatively easy syntax
Plots usually need a lot of customization before they are finished

ggplot2

A package you must install.
More complex syntax
Produces publication-quality graphics with less customization needed.
Plots can be saved in a variable, with customization added as needed.

Histograms: Base graphics

The workhorse function is hist

# generate some random numbers from the lognormal distribution
my_var = rlnorm(1000, log(10), 0.5)
hist(my_var)

Histograms: Base graphics

The workhorse function is hist

The defaults are not very nice, so lets improve things

main = "" disables the title
xlab and ylab control axis labels
breaks controls the number of bins in the histogram
col sets the color of the bars
border sets the color of the borders (NA: no border)

hist(my_var, main = "", xlab="Variable of interest", ylab = "Frequency", 
     breaks = 20, col="#999999", border=NA)

Colors

R supports colors using the common HTML color coding: #RRGGBB
- RR, GG, BB are the amounts of red, green, and blue
- each ranges from 00 (none) to FF (most)
- Color pickers online help you translate a color in real life to a coded color
R also has 657 named colors: col = rosybrown
see the colors() function for the names

head(colors())
## [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"

Populations & Samples

A population is the unit about which we want to learn about.
- e.g., all the trees in a forest

Populations & Samples

A population is the unit about which we want to learn about.
- e.g., all the trees in a forest
A sample is a subset of a population that is observed with the goal of obtaining knowledge about a population.
- We use statistics to describe, analyse, or perform inference about a population from a sample.

Populations & Samples

Considerations when sampling:

Select, e.g., 7 trees at random
Samples must be taken carefully so that they are representative of the entire population!
Gold standard: sample sizes must be large and drawn randomly from the population
Experimental design: designing the sample such that it represents the desired population

Describing populations and samples

We can visualise the distribution of values within a population using a histogram

We use summary statistics to describe the values numerically.
In practise, we must estimate population statistics via sample statistics.
We can formally describe the shape of the distribution of values using special statistics called moments.

Summary statistics: location

First moment: Arithmetic mean.

The population mean ($\mu$) can be approximated with the sample mean:

\[ \mu \approx \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \]

mean(my_var)
## [1] 11.68838

Summary statistics: location

First moment: Arithmetic mean.

Other location statistics

The mean can be strongly influenced by outliers.

median: The 50% quantile (i.e., half the values of x are above the median, half below)

median(my_var)
## [1] 10.26268

Summary statistics: location

First moment: Arithmetic mean.

Other location statistics

The mean can be strongly influenced by outliers.

median: The 50% quantile (i.e., half the values of x are above the median, half below)
mode: the most common value, the “peak” of a distribution

Summary statistics: location

First moment: Arithmetic mean.

Other location statistics

The mean can be strongly influenced by outliers.

median: The 50% quantile (i.e., half the values of x are above the median, half below)
mode: the most common value, the “peak” of a distribution
How to compute the mode? Use hist!

# mode
mode(my_var) # wrong!
## [1] "numeric"


# we can use the histogram function to approximate the sample mode
# WARNING: changing the number of breaks can have a 
# large impact on the results
my_hist = hist(my_var, breaks = 30, plot = FALSE)

# the mids variable gives you the midpoint of each bin
# counts gives you the count each bin
# cbind shows them together in columns
cbind(bar_midpoint = my_hist$mids, count = my_hist$counts)
##       bar_midpoint count
##  [1,]            1     2
##  [2,]            3    31
##  [3,]            5   116
##  [4,]            7   161
##  [5,]            9   170
##  [6,]           11   147
##  [7,]           13    92
##  [8,]           15    84
##  [9,]           17    48
## [10,]           19    50
## [11,]           21    31
## [12,]           23    24
## [13,]           25    15
## [14,]           27     5
## [15,]           29    10
## [16,]           31     6
## [17,]           33     2
## [18,]           35     4
## [19,]           37     0
## [20,]           39     1
## [21,]           41     0
## [22,]           43     0
## [23,]           45     1

Summary statistics: location

First moment: Arithmetic mean.

Other location statistics

The mean can be strongly influenced by outliers.

median: The 50% quantile (i.e., half the values of x are above the median, half below)
mode: the most common value, the “peak” of a distribution
How to compute the mode? Use hist!

# use which.max to find out which count index (which bar) is the largest
# then take that midpoint -- this is the mode
(mode_index = which.max(my_hist$counts))
## [1] 5

my_hist$mids[mode_index]
## [1] 9

Summary statistics: location

First moment: Arithmetic mean.

Other location statistics

minimum, maximum values
quantiles: the p-th quantile of x is the value of x such that p% of values are beneath the quantile
- commonly used: (2.5%, 97.5%), 50%, (25%, 75%)

Comparing variables

We can compare variables in a way that is location independent by centering (subtracting the mean)

c(min(my_var), max(my_var))
## [1]  1.670021 44.760331

range(my_var)
## [1]  1.670021 44.760331


quantile(my_var, 0.4)
##      40% 
## 9.036879


# Centre the variable
# note! arithmetic operators like `-` are vectorized!
my_var_ctr = my_var - mean(my_var)
mean(my_var_ctr)
## [1] 5.448975e-16

Summary statistics: dispersion (or scale)

Second moment: Variance ($\sigma^2$)

\[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i-\mu)^2 \]

We can estimate $\sigma^2$ using the sample variance:

\[ \sigma^2 \approx s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \]

It is convenient to talk about the scale of $x$ in the same units as $x$ itself, so we use the (population or sample) standard deviation:

\[ \sigma = \sqrt{\sigma^2} \approx s = \sqrt{s^2} \]

Summary statistics: dispersion (or scale)

Second moment: Variance ($\sigma^2$)

\[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i-\mu)^2 \]

We can estimate $\sigma^2$ using the sample variance:

\[ \sigma^2 \approx s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \]

It is convenient to talk about the scale of $x$ in the same units as $x$ itself, so we use the (population or sample) standard deviation:

\[ \sigma = \sqrt{\sigma^2} \approx s = \sqrt{s^2} \]

# Population variance -- biased if x is a sample!
# one-line version:
# sum((my_var - mean(my_var))^2)/length(my_var)

N = length(my_var)
sq_dev = (my_var - mean(my_var))^2
sum(sq_dev)/N
## [1] 36.66053


# These R functions always produce the sample variance and sd
var(my_var)
## [1] 36.69723

sd(my_var)
## [1] 6.057824

Summary statistics: dispersion (or scale)

Other dispersion statistics

range: max(x) - min(x)
interquartile range (IQR): the difference between the third and first quartiles
coefficient of variation (CV): \[ \frac{s}{|\bar{x}|} \]

# the range function gives you the min and max
# Take the difference to get the statistical range
diff(range(my_var))
## [1] 43.09031

max(my_var) - min(my_var)
## [1] 43.09031


IQR(my_var)
## [1] 7.181614

quantile(my_var, 0.75) - quantile(my_var, 0.25)
##      75% 
## 7.181614


# coefficient of variation can be done manually
sd(my_var)/mean(my_var)
## [1] 0.5182773

Summary statistics: higher moments

Asymmetry: skewness

Is the distribution weighted to one side or the other?

\[ \mathrm{skewness} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^3}{(n-1)s^3} \]

Tailedness: kurtosis

How fat are the tails relative to a normal distribution?

\[ \mathrm{kurtosis} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^4}{(n-1)s^4} \]

Another way to visualise a sample: boxplots

Boxplots (sometimes box-and-whisker diagrams) summarize key statistics.
- median
- first and third quartiles (hinges)
- approx. 95% confidence interval for median (notch)
- min/max (or quartile + 1.5*IQR) (whiskers)
They are very useful for comparing variables among groups.
Base graphics: we use the boxplot function
y ~ group is a special data type called a formula

boxplot(body_mass_g ~ species, data = penguins, boxwex=0.4, notch = TRUE)

More on boxplots

Boxplots (sometimes box-and-whisker diagrams) summarize key statistics.
- median
- first and third quartiles (hinges)
- approx. 95% confidence interval for median (notch)
- min/max (or quartile + 1.5*IQR) (whiskers)
They are very useful for comparing variables among groups.
Base graphics: we use the boxplot function
y ~ group is a special data type called a formula

boxplot(body_mass_g ~ species, data = penguins, boxwex=0.4, notch = TRUE)

More on boxplots

Boxplots (sometimes box-and-whisker diagrams) summarize key statistics.
- median
- first and third quartiles (hinges)
- approx. 95% confidence interval for median (notch)
- min/max (or quartile + 1.5*IQR) (whiskers)
They are very useful for comparing variables among groups.
Base graphics: we use the boxplot function
y ~ group is a special data type called a formula

For more complex groupings, better to use ggplot

peng_bplot = ggplot(penguins, aes(y = body_mass_g, x = sex, 
                            fill = species)) + 
    geom_boxplot(notch = TRUE) + 
    theme_minimal() + ylab("Body Mass (g)") + xlab("") +
    scale_fill_brewer(palette = "Paired") 

peng_bplot

Programming mini-lesson: repeating yourself

An important programming concept is DRY: Don’t Repeat Yourself!
- Don’t copy and paste code if you can instead instruct the computer to repeat things for you.
There are many ways to execute the same code repeatedly; we will cover these in detail as they are needed.
A motivating example:
- Compute the mean of every numeric variable in the penguins dataset

# bad!
(col_means = c(
  mean(penguins[,3]),
  mean(penguins[,4]),
  mean(penguins[,5]),
  mean(penguins[,3])  # it's easy to introduce mistakes this way!
))
## [1]  43.99279  17.16486 200.96697  43.99279

Applying functions

We want to apply the function named mean separately to every variable in penguins
R has many different ways to apply functions, in this case we use simple apply (sapply)

# bad!
(col_means = c(
  mean(penguins[,3]),
  mean(penguins[,4]),
  mean(penguins[,5]),
  mean(penguins[,3])  # it's easy to introduce mistakes this way!
))
## [1]  43.99279  17.16486 200.96697  43.99279


# better
# first we find out which columns are numeric
numeric_columns = sapply(penguins, is.numeric)

# then we get the means of those columns
(col_means = sapply(penguins[,numeric_columns], mean))
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##          43.99279          17.16486         200.96697        4207.05706 
##              year 
##        2008.04204

Applying functions

We want to apply the function named mean separately to every variable in penguins
R has many different ways to apply functions, in this case we use simple apply (sapply)
this works with any function that accepts a vector, even when they return more than a single value

sapply(penguins[,numeric_columns], range)
##      bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## [1,]           32.1          13.1               172        2700 2007
## [2,]           59.6          21.5               231        6300 2009

# here, we pass an additional argument named probs to quantile
# see ?quantile for what this does
sapply(penguins[,numeric_columns], quantile, probs = c(0.25, 0.5, 0.75))
##     bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## 25%           39.5          15.6               190        3550 2007
## 50%           44.5          17.3               197        4050 2008
## 75%           48.6          18.7               213        4775 2009

Tabular applies

There are three species of penguin; it makes more sense to have summary statistics for each one
tapply: tabular apply
- produces a summary table for a single variable (e.g., bill_length_mm) based on categories in another variable (e.g., species)

# with: use the variables found in the data.frame penguins
with(penguins, 
     # compute the mean of bill_length_mm for each of the three species
     tapply(bill_length_mm, species, mean)
)
##    Adelie Chinstrap    Gentoo 
##  38.82397  48.83382  47.56807

Probability distributions

We have shown several examples of curves we could describe a distribution of values.
What does it mean exactly to say that a variable follows a probability distribution?
How do these distributions arise in nature?

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

Exercise: Try to simulate this process in R. What does the distribution of locations look like after 10 steps? After 100? What is the long-run distribution with many steps and many players?

# if you want to draw the soccer pitch, you can read the function for it
source("r/football.r")

# 100 players, all on the centre line (value = 0)
nplayers = 100
players = rep(0, nplayers)

# define the size of the steps for each coin flip
heads = 0.5
tails = -0.5

# simulate one step for each player
steps = sample(c(heads, tails), nplayers, replace = TRUE)

# update player locations
players = players + steps

# visualise, if desired
draw_pitch(players)
hist(players)

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

Exercise: Try to simulate this process in R. What does the distribution of locations look like after 10 steps? After 100? What is the long-run distribution with many steps and many players?

# simulate 500 steps with 1000 players
nplayers = 1000
players = rep(0, nplayers)

for(i in 1:500) {
    steps = sample(c(heads, tails), nplayers, replace = TRUE)
    players = players + steps
}

# visualise, if desired
draw_pitch(players)
hist(players)

Probability distributions: PDFs

We can rescale this histogram, such that the area of each bar represents the proportion of samples within that bin.

The height of each bar is the probability density.
The figure approximates an empirical probability density function.

hist(players, breaks=30, col="gray", main = "", freq=FALSE)

Probability distributions: PDFs

We can rescale this histogram, such that the area of each bar represents the proportion of samples within that bin.
The height of each bar is the probability density.
The figure approximates an empirical probability density function.
We can use the density function in R to add a curve approximating this density.

hist(players, breaks=30, col="gray", main = "", freq=FALSE)
lines(density(players, adjust=1.5), col='red', lwd=2)

Normal distributions

This looks a lot like a normal distribution.
As we take smaller and smaller steps, and do more coin flips, the distribution of values will converge exactly on a normal distribution.
Here we add a normal PDF based on the sample mean and sd (in blue).

hist(players, breaks=40, col="gray", main = "", freq=FALSE)
lines(density(players, adjust=1.5), col='red', lwd=2)
mu = mean(players)
sig = sd(players)
x_norm = seq(min(players), max(players), length.out = 400)
y_norm = dnorm(x_norm, mu, sig)
lines(x_norm, y_norm, lwd=2, col='blue')
legend("topright", legend=c(paste("sample mean =", round(mu, 2)), 
                            paste("sample sd =", round(sig, 2))), lwd=0, bty='n')

Normal distributions

In nature, any process that involves summing small random values produces exactly this shape.
The distribution is symmetric and unimodal.
mean, median, mode are all equal.
The distribution has two parameters, the mean ($\mu$) and the standard deviation ($\sigma$).
If $\mu = 0$ and $\sigma = 1$, the distribution is called the standard normal.
Transforming a variable so that $\mu = 0$ and $\sigma = 1$ is called standardization. \[ x_{std} = \frac{(x - \bar{x})}{s}\approx\frac{(x - \mu)}{\sigma} \]
In R, we can use the scale function.

Normal distributions

The typical “bell-curve” shape of the normal distribution is the Normal Probability Density Function (PDF)

\[ \mathcal{f}(x) = \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} \]

It is easiest to understand what this function represents by understanding its integral.
Recall that the area of each bar in our histogram represents the proportion of observations in that bin.

Normal distributions: area under the curve

The integral between two values $a$ and $b$ is equal to the area under the curve.
This is exactly the probability of observing a value between $a$ and $b$.
Rule of thumb: about 68% of values in any normal distribution will be within 1 standard deviation of the mean.

Normal distributions: area under the curve

The integral between two values $a$ and $b$ is equal to the area under the curve.
This is exactly the probability of observing a value between $a$ and $b$.
Rule of thumb: about 68% of values in any normal distribution will be within 1 standard deviation of the mean.
Rule of thumb: 95% of values are $\mu \pm 1.96 \sigma \approx \mu \pm 2 \sigma$.

Normal distributions: CDF

If we integrate from $-\infty$ to some value $x$, then we have the probability of observing a value less than $x$.
This integral of the Normal PDF is known as the Normal cumulative distribution function.

\[ \mathcal{g}(x) = \int_{-\infty}^{x} \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} dx \]

Distributions in R

R has four functions in common for many families of distributions:

PDF: what is the probability density when $x=3$ (the height of the bell curve)

dnorm(x = 3, mean = mean(players), sd = sd(players))
## [1] 0.03452526

Distributions in R

R has four functions in common for many families of distributions:

PDF: what is the probability density when $x=3$ (the height of the bell curve)

dnorm(x = 3, mean = 0, sd = 1)
## [1] 0.004431848

CDF: what is the cumulative probability when $x=q$

(area under the bell curve from $-\infty$ to $q$)

(probability of observing a value < $q$)

pnorm(q = mean(players) - 1.96 * sd(players), 
      mean = mean(players), sd = sd(players))
## [1] 0.0249979

Distributions in R

R has four functions in common for many families of distributions:

Quantiles: what is the value of $x$, such that the probability of observing x or smaller is $p$

(inverse of the CDF/pnorm)

qnorm(p = 0.025, mean = 0, sd = 1)
## [1] -1.959964

Distributions in R

R has four functions in common for many families of distributions:

Quantiles: what is the value of $x$, such that the probability of observing x or smaller is $p$

(inverse of the CDF/pnorm)

qnorm(p = 0.025, mean = 0, sd = 1)
## [1] -1.959964

RNG: Random number generator, produces $n$ random numbers from the desired distribution

rnorm(n = 10, mean= 0, sd = 1)
##  [1] -1.7879356  1.6209792 -0.9772414  0.7117817 -0.9274279 -0.5008930
##  [7] -1.9062491 -0.1495901 -0.7830480  0.2145755

Sampling error

In practise, we work with samples, which are imperfectly representative of the populations from which they are drawn.
The sample mean $(\bar{x})$ and sample standard deviation $(s)$ approximate the population $\mu$ and $\sigma$.
We would like to known how good our estimates are!
A standard error tells you how well a sample statistic represents the population parameter.
If you repeat your sampling many times, and compute a separate sample statistic each time (e.g., $\bar{x}$, $s$, etc)…
The sample statistics will converge on a normal distribution (if the sample size is large).
The standard deviation of sample statistics will equal the standard error of the statistic.

Standard error of the mean

In practise, we work with samples, which are imperfectly representative of the populations from which they are drawn
The sample mean $(\bar{x})$ and sample standard deviation $(s)$ approximate the population $\mu$ and $\sigma$
We would like to known how good our estimates are!
A standard error tells you how well a sample statistic represents the population parameter.
If you repeat your sampling many times, and compute a separate sample statistic each time (e.g., $\bar{x}$, $s$, etc)…
The sample statistics will converge on a normal distribution (if the sample size is large)
The standard deviation of sample statistics will equal the standard error of the statistic.
We commonly use the standard error of the mean to make inferences about the mean of a sample
A larger sample size results in a smaller standard error

\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}} \]

Central limit theorem

If you repeat your sampling many times, and compute a separate sample statistic each time (e.g., $\bar{x}$, $s$, etc)…
The sample statistics will converge on a normal distribution (if the sample size is large).

This holds generally in many natural systems.
If a variable $x$ is normally distributed, then $\bar{x}$ will be normally distributed, regardless of sample size.
The farther $x$ is away from a normal distribution, the larger $n$ must be before $\bar{x}$ follows a normal distribution.
If $n$ is large enough…
$\bar{x}$ will be normally distributed with a mean = $\mu$ and standard deviation = the standard error of the mean.

The Basics

Course Introduction

Course Introduction

Course structure

Website and course materials

Introduction to R

Helpful vocabulary

Recommendations

Variables

Recommendations

Data types

Operators

Logic

Logic

Functions

Basic data structures: vectors

Basic data structures: vectors

Basic data structures: vectors

Basic data structures: data frames

Working with data frames

Working with data frames

Working with data frames

Code Style

Working with data frames

Working with data frames

Graphics in R

Base graphics

ggplot2

Histograms: Base graphics

Histograms: Base graphics

Colors

Colors

Populations & Samples

Populations & Samples

Populations & Samples

Considerations when sampling:

Describing populations and samples

Summary statistics: location

Summary statistics: location

Other location statistics

Summary statistics: location

Other location statistics

Summary statistics: location

Other location statistics

Summary statistics: location

Other location statistics

Summary statistics: location

Other location statistics

Comparing variables

Summary statistics: dispersion (or scale)

Summary statistics: dispersion (or scale)

Summary statistics: dispersion (or scale)

Other dispersion statistics

Summary statistics: higher moments

Asymmetry: skewness

Tailedness: kurtosis

Another way to visualise a sample: boxplots

More on boxplots

More on boxplots

Programming mini-lesson: repeating yourself

Applying functions

Applying functions

Tabular applies

Probability distributions

Probability distributions

Probability distributions

Probability distributions

Probability distributions

Probability distributions: PDFs

Probability distributions: PDFs

Normal distributions

Normal distributions

Normal distributions

Normal distributions: area under the curve

Normal distributions: area under the curve

Normal distributions: CDF

Distributions in R

Distributions in R

Distributions in R

Distributions in R