wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/beausoleilmo/Rsimulation/main/workshopXX-En/scripts/workshopXX-en.r
wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/beausoleilmo/Rsimulation/main/workshopXX-En/scripts/workshopXX-en.r
Theory:
The exercices for this workshop: Challenge wallpaper, Challenge 1, challenge 2, challenge 3, challenge 4, challenge 5, challenge 6, challenge 7.
Supporting material:
Please check out the Simulation in R Cheat Sheet
This workshop was designed to give you the tools to perform simulations in R.
For this workshop it is useful to have a solid understanding of
See the QCBS R workshops if you want to revise these topics. Select the R workshops needed.
Ultimately, I would like the students to be able
Consider that it takes time to develop a 'programers' intuition when coding. There are multiple ways to get to the same answer and the idea is not to be perfect at first, but to try to answer a question and improve.
In this section, we will explore
What is a model? Can you give examples of models that are not statistical models?
Basically our task is to take statistical Clay and model it as we wish based on some idea. In this workshop, I want you to be able to play a bit more with statistical plasticine. "Pâte à modelée" (Plasticine, Play-Doh)
The goal is to make part of our world easier to understand, define, quantify, vizualize, simulate, etc. See Scientific modelling
What is a model? Can you give examples of models that are not statistical models?
Basically our task is to take statistical Clay and model it as we wish based on some idea. In this workshop, I want you to be able to play a bit more with statistical plasticine. "Pâte à modelée" (Plasticine, Play-Doh)
The goal is to make part of our world easier to understand, define, quantify, vizualize, simulate, etc. See Scientific modelling
Some examples
Biological models
Statistical models
See also Scientific modelling
let's start with what are simulations
"Models can help to guide a scientist's intuition about
But mathematical models also have their limitations. The results of an analysis are only as interesting as the biological questions motivating a model. [...]"
— Sarah. P. Otto, and Troy Day 2007 (modified)
Models can be deterministic (full information, prédicts entirely the future) or stochastic (random events affect the system, prediction can be only probabilistic)
Citation: "Science will progress faster and further by a marriage of mathematical and empirical biology. This marriage will be even more successful if more biologists can use math, when needed, to advance their own research goals. It is toward this end that we devote this book." — Sarah. P. Otto, and Troy Day 2007
NOTE: Box's, All models are wrong, but some are useful "The aphorism is generally attributed to the statistician George Box, although the underlying concept predates Box's writings." "Box, George E. P. (1976), "Science and statistics" (PDF), Journal of the American Statistical Association, 71 (356): 791–799, doi:10.1080/01621459.1976.10480949."
Would be unproductive to say that all maps are wrong and dismiss them because of that. They are still very useful!
"[...] the term 'simulation' describes a wealth of varied and useful techniques, all connected with the mimicking of the rules of a model of some kind."
— Byron J. T. Morgan 1984
Are these two distribution showing the same stochastic process?
Games:
responsible and need to take conscious choices [...]
IMAGE: Are these two distribution showing the same random process?
Yes! They are both normally distributed
Étienne Low-Décarie says "it eliminates the need for dice when playing Dungeons and Dragons"
library(TeachingDemos)dice(rolls=3, ndice=4, sides=6, plot.it=T, load=rep(1, 6))
You probably learned how to use statistical models on certain types of data.
You probably learned how to use statistical models on certain types of data.
Simulations help you,
Random: 'That is governed by the law of probability'
Expected value:
Forest management: could be long to get data: simulations can help to take faster decision
See book Morgan, B.J.T., 1984. Elements of simulation
We might start with a research question or simply want to understand a process.
Xcorrelated→Y
Everything that we measure is caused by factors that scientists try to make sense.
The question here is:
Random (or stochastic process) variables
For power analysis...
power.prop.test()
The list below was taken from Banks (1998 Handbook of simulation).
Pros:
Cons:
They are offset by
You might be wondering:
Learn about science in general
Design a particular study
Publish an article or criticize
etc.
Here is a short list of elements for when you could be using it
Simulation is (Banks 1998)
You can see now see with the simulation that
See also this example Online Box 5: R code to simulate drift, selection, and dispersal of two species in any number of patches: Figures 6.7-6.9 in the book
Here are 2 plots with an independent variable X and a response variable y.
Find correlation and covariance
Here are 2 plots with an independent variable X and a response variable y.
What would you infer about correlation and covariance?
What can be seen?
set.seed(123)n = 250x.1 = rnorm(n, mean = 15, sd = 5)y.1 = 2*x.1 +rnorm(n, mean = 0, sd = 4)#rnorm(n, mean = 5, sd = 2)x.2 = rnorm(n, mean = 15, sd = 1)y.2 = 2*x.2 +rnorm(n, mean = 0, sd = .5) # rnorm(n, mean = 5, sd = .5)
What can be seen?
set.seed(123)n = 250x.1 = rnorm(n, mean = 15, sd = 5)y.1 = 2*x.1 +rnorm(n, mean = 0, sd = 4)#rnorm(n, mean = 5, sd = 2)x.2 = rnorm(n, mean = 15, sd = 1)y.2 = 2*x.2 +rnorm(n, mean = 0, sd = .5) # rnorm(n, mean = 5, sd = .5)
What can be seen?
set.seed(123)n = 250x.1 = rnorm(n, mean = 15, sd = 5)y.1 = 2*x.1 +rnorm(n, mean = 0, sd = 4)#rnorm(n, mean = 5, sd = 2)x.2 = rnorm(n, mean = 15, sd = 1)y.2 = 2*x.2 +rnorm(n, mean = 0, sd = .5) # rnorm(n, mean = 5, sd = .5)
What can be seen?
Assumptions of linear model
Equality of variance (see residuals)
set.seed(123)n = 250x.1 = rnorm(n, mean = 15, sd = 5)y.1 = 2*x.1 +rnorm(n, mean = 0, sd = 4)#rnorm(n, mean = 5, sd = 2)x.2 = rnorm(n, mean = 15, sd = 1)y.2 = 2*x.2 +rnorm(n, mean = 0, sd = .5) # rnorm(n, mean = 5, sd = .5)
What can be seen?
set.seed(123)n = 250x.1 = rnorm(n, mean = 15, sd = 5)y.1 = 2*x.1 +rnorm(n, mean = 0, sd = 4)#rnorm(n, mean = 5, sd = 2)x.2 = rnorm(n, mean = 15, sd = 1)y.2 = 2*x.2 +rnorm(n, mean = 0, sd = .5) # rnorm(n, mean = 5, sd = .5)
What can be seen?
Distribution of x and y to better understand what could have PRODUCED these variables.
It might not appear very clear, but the data in
set.seed(1217)n = 250b0 = 0.8 ; b1 = 0.5 ; b2 = 0.5x1 = runif(n, min = 0, max = 6)lambda1 = exp(b0 + b1*x1)y1 = rpois(n, lambda = lambda1) x2 = rnorm(n, mean = 3, sd = 1)lambda2 = exp(b0 + b2*x2)y2 = rpois(n, lambda = lambda2)
What can be seen?
Distribution of x and y to better understand what could have PRODUCED these variables.
It might not appear very clear, but the data in
set.seed(1217)n = 250b0 = 0.8 ; b1 = 0.5 ; b2 = 0.5x1 = runif(n, min = 0, max = 6)y1 = rpois(n, lambda = lambda1) x2 = rnorm(n, mean = 3, sd = 1)lambda2 = exp(b0 + b2*x2)y2 = rpois(n, lambda = lambda2)
What can be seen?
Distribution of x and y to better understand what could have PRODUCED these variables.
It might not appear very clear, but the data in
set.seed(1217)n = 250b0 = 0.8 ; b1 = 0.5 ; b2 = 0.5x1 = runif(n, min = 0, max = 6)y1 = rpois(n, lambda = lambda1) x2 = rnorm(n, mean = 3, sd = 1)lambda2 = exp(b0 + b2*x2)y2 = rpois(n, lambda = lambda2)
expand.grid()
and outer
The Essence of R’s expand.grid() FunctionDescription section. Add a 'Description' section to all your simulation scripts to introduce what the script is about.
# Description ------------------------------------------------------------#### ### ### ## #### ### ### ## #### ### ### ## # TITLE OF THE SCRIPT# Created by YOUR_NAME# # Why: # Requires:# NOTES: # Reference : #### ### ### ## #### ### ### ## #### ### ### ## # Code here ...
Comments. Be extra generous when commenting your code to describe as precisely as possible for the information that is not variable.
# Description ------------------------------------------------------------# This is the description section as previously presented # Libraries ---------------------------------------------------------------# Here load the libraries used in the script library(ggplot2)# Functions ---------------------------------------------------------------# Add and describe the functions used in the script ## Add more informationfunction.name = function(var){tmp=2+var; return(tmp)}# Plots -------------------------------------------------------------------# Plotting the data simulated ## Add more informationplot(function.name(1:10)+rnorm(10))
(We used fewer comments to lighten the presentation, but please add comments.)
Also don't forget to add sections in your script.
Alt + Shift + K
to get the shortcut list of RStudio. Cmd + Shift + R
to automatically add a section that will be visible in the document outline. Time to run a simulation is important: knowing that some functions are faster in R can make a difference for a slow simulation with a fast one.
R
pdf
, png
, svg
, or anyother function to save your plotssessionInfo()
function)ATTENTION. Usually, coding simulation is written in small increments and then put together in a cohesive function that will make what we are looking for.
In this section, we will discover functions that might help when coding simulations in R.
Function | Definition |
---|---|
set.seed | Control the Random Number Generation |
sample | Random Samples and Permutations |
rep | Replicate Elements of Vectors and Lists |
gl | Generate Factor Levels |
expand.grid | Create a Data Frame from All Combinations of Factor Variables |
replicate (see sapply) | Wrapper of sapply for repeated evaluation of an expression |
for, if, while, next, break | Control Flow |
p, d, r, +'name of dist.' | See the table of distributions in R |
plot, hist, points, lines, curve | Plotting functions |
R
has the function set.seed()
that help us to play with the RNG.The example below uses a RNG to extract numerical value between 1 and 10
runif(n = 1, min = 1, max = 10) # Gives a random number between 1 and 10# [1] 3.588198runif(n = 1, min = 1, max = 10) # RNG wasn't reset, different answer (see above)# [1] 8.094746runif(n = 1, min = 1, max = 10) # Different again... # [1] 4.680792set.seed(42); runif(n = 1, min = 1, max = 10) # This sets the RNG # [1] 9.233254set.seed(42); runif(n = 1, min = 1, max = 10) # The exact same number # [1] 9.233254
??set.seed
) set.seed(seed, kind = "Mersenne-Twister")
sample()
randomly picks a value from a vector (i.e., random sampling).The example below uses a RNG to extract numerical value between 1 and 10
set.seed(12) # Set the RNG v.1.10 = 1:10 # Make a vector from 1 to 10 # Randomly pick 1 (size) value from the vector (x), without replacement sample(x = v.1.10, size = 1, replace = FALSE) # [1] 2
set.seed(3) # Set the RNG # Randomly pick 5 (size) letters from the vector (x), without replacement sample(x = LETTERS, size = 5, replace = FALSE) # [1] "E" "L" "G" "D" "H"sample(x = as.factor(month.abb), size = 5, replace = FALSE) # [1] Nov Aug Apr Jul Dec# Levels: Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
sample()
is a very versatile function that can be used in a variety of places. sample()
can actually be used in order to do permutations set.seed(123)n = 40; col = viridis::viridis(n = n)x = 1:n ; y = 2+.5*x + rnorm(n, sd=7)df.xy = data.frame(x,y, col )
set.seed(321)df.xy$x.s = sample(df.xy$x)df.xy$y.s = sample(df.xy$y)# We break up the link of X and Y
Permutations:
Imagine a Rubik's cube. If you rotate one side, you got a permutation of the cube (a new arrangement)
for permutations
The highlighted lines are where the permutation occurs
?sample
Random Samples and Permutations
# From the ?samplex <- 1:12sample(x) # a random permutationsample(x, replace = TRUE) # bootstrap resampling, length(x) > 1set.seed(1);sample(c(0,1), 100, replace = TRUE, prob = .5) # 100 Bernoulli trialsset.seed(1);rbinom(n = 100,size = 1,.5)
permutate.df = replicate(n = 200, # nperm expr = data.frame(y = sample(df.lm$y,size = nrow(df.lm), replace = FALSE), x = df.lm$x), simplify = FALSE)bootstrap.fun <- function(data) { tmp.sample = sample(1:nrow(data),size = c(nrow(data)-20), replace = FALSE) data.frame(y = data[tmp.sample,"y"], x = data[tmp.sample,"x"]) } # end boot funbootstrap.df = replicate(n = 200, expr = bootstrap.fun(df.lm), simplify = FALSE)
set.seed(42)# Get a sequence of datesdatae.seq = seq(from = as.Date('2010/01/01'), to = as.Date('2022/01/01'), by = "day")# Look at beginning and end of sequence head(datae.seq, 4); tail(datae.seq, 4)# [1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04"# [1] "2021-12-29" "2021-12-30" "2021-12-31" "2022-01-01"# Get only 5 elements of the generated sequence sample(datae.seq, 5)# [1] "2017-02-21" "2021-02-20" "2016-06-26" "2013-01-02" "2013-06-05"
set.seed(50)p_dice = c(1,1,1,1,1,5) # Here we have 5 times the change of landing on 6 # Same as writing p_dice/sum(p_dice) or the prob.nb.tosses = 100die_results <- sample(x = 1:6, # or seq(from = 1, to=6, by=1) size = nb.tosses, replace = T, prob = p_dice) barplot(table(die_results), ylab = "Fq", xlab ="Face", main = "Loaded dice") # table(die_results)/nb.tosses
R
easily. rep()
function can help you with this(let4first = LETTERS[1:4])# [1] "A" "B" "C" "D"rep(let4first, times = 2) # Repeat the WHOLE sequence twice # [1] "A" "B" "C" "D" "A" "B" "C" "D"rep(let4first, each = 2) # Repeat each element twice # [1] "A" "A" "B" "B" "C" "C" "D" "D"# Set the number of repeat for each element in the vector rep(let4first, times = c(1,2,3,6))# [1] "A" "B" "B" "C" "C" "C" "D" "D" "D" "D" "D" "D"# Complete replication: replicate each element twice and do this three times rep(let4first, each = 2, times = 3)# [1] "A" "A" "B" "B" "C" "C" "D" "D" "A" "A" "B" "B" "C" "C" "D" "D" "A" "A" "B"# [20] "B" "C" "C" "D" "D"rep(let4first, length.out = 6) # Repeat the vector until you hit the length.out# [1] "A" "B" "C" "D" "A" "B"
rep()
R
that can help you generate factor levels (with the gl
function). nb.of.levels = 2nb.of.replicates = 8labels.for.the.factors = c("Control", "Treat")## First control, then treatment:gl(n = nb.of.levels, k = nb.of.replicates, labels = labels.for.the.factors)# [1] Control Control Control Control Control Control Control Control Treat # [10] Treat Treat Treat Treat Treat Treat Treat # Levels: Control Treat## 20 alternating As and Bsgl(n = 2, k = 1, length = 20, labels = LETTERS[1:2])# [1] A B A B A B A B A B A B A B A B A B A B# Levels: A B## alternating pairs of 1s and 2sgl(n = 2, k = 2, length = 19) # see last element missing# [1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2# Levels: 1 2
gl()
functionreplicate()
can be used instead of a for loop
and make simulations faster. set.seed(1234)data.replicated = replicate(n = 2, expr = data.frame(gr = rep(LETTERS[1:3], each = 2), y = rnorm(6)), simplify = FALSE)
data.replicated[[1]]# gr y# 1 A -1.2070657# 2 A 0.2774292# 3 B 1.0844412# 4 B -2.3456977# 5 C 0.4291247# 6 C 0.5060559
data.replicated[[2]]# gr y# 1 A -0.5747400# 2 A -0.5466319# 3 B -0.5644520# 4 B -0.8900378# 5 C -0.4771927# 6 C -0.9983864
n
number of times and to record the output. apply
, do.call
, etc. expand.grid()
will make a data frame from all combinations of factor variablesset.seed(1234)exp.design = expand.grid(height = seq(60, 70, 10), weight = seq(100, 200, 100), sex = c("Male","Female"), stringsAsFactors = TRUE) # characters will be factorsexp.design# height weight sex# 1 60 100 Male# 2 70 100 Male# 3 60 200 Male# 4 70 200 Male# 5 60 100 Female# 6 70 100 Female# 7 60 200 Female# 8 70 200 Female
DT::datatable(exp.design, fillContainer=FALSE, options=list(pageLength = 5))
expand.grid()
to make a deck of cardscards = c( "A",2:10, "J", "Q", "K"); length(cards) # Get cards type# [1] 13suits = c("♦", "♣", "♥", "♠" ) # Get the suits c("Diamonds", "Clubs", "Hearts", "Spades")cute.deck <- expand.grid(cards = cards, suits = suits)cute.deck$nice.cards = apply(cute.deck, 1, function(x) paste0(x, collapse = "")) # Combine cards and suitscute.deck$col = ifelse(cute.deck$suits %in% c("♦","♥"),"red","black") # add colour # Select cards at random set.seed(1234)n.c = 5row.sel = sample(1:nrow(cute.deck),size = n.c, replace = FALSE)cute.deck[row.sel,"nice.cards"]# [1] "2♥" "3♣" "9♣" "J♥" "5♠"
# probability of extracting 2 aces from a deck of 52 cards which has 4 aces and that we draw 3 cardsnb.aces.in.deck = 4nb.of.cards.in.deck = 52draw.cards = 3nb.cards.wish.to.check = 2# This is modeled with a hypergeometric distribution (discrete), which calculate probabilities when sampling without replacementdhyper(nb.cards.wish.to.check, m = nb.aces.in.deck, n = nb.of.cards.in.deck-nb.aces.in.deck, k = draw.cards)# [1] 0.01303167
# Check prop. of each suits in a deck table(cute.deck$suits)/nrow(cute.deck) # # ♦ ♣ ♥ ♠ # 0.25 0.25 0.25 0.25# cards <- c("Ace", "Deuce", "Three", "Four","Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King")# Define suits, cards, valuessuits <- c("Diamonds", "Clubs", "Hearts", "Spades")cards <- c("Ace", 2:10, "Jack", "Queen", "King")# Build deck, replicated proper number of timesdeck <- expand.grid(cards=cards, suits=suits)deck$value <- c(1, 2:9, rep(10, 4))deck$col = ifelse(deck$suits %in% c("Diamonds","Hearts"),"red","black")
outer()
can apply a function to the input arrays. x <- 1:10; names(x) <- x# Multiplication & Power Tablesx %o% x # same as outer(x, x, "*")x.mul = outer(x, x, "*")x.sub = outer(x, x, "-")x.add = outer(x, x, "+")x.div = outer(x, x, "/")
rdm.nb
and #<<par(bg = NA, mar = c(0,0,0,0)) # will take the whole screenx = 1:70; y = 1:70 # make variables x.val = (x^2); y.val = yxy.out = outer(x.val, y.val, "-") # manipulate the variablesnb.col.pal = length(grDevices::hcl.pals()) # find all palettes in hcl.pals#<<xy.out = xy.out + rnorm(length(xy.out), mean = 0, sd = 200) # Prettify with RDMimage(t(xy.out),xaxt = "n", yaxt = "n", bty = "n", col = hcl.colors(1000, palette = hcl.pals()[rdm.nb])) # 61,93 looks good
rdm.nb
and set.seed(9)par(bg = NA, mar = c(0,0,0,0)) # will take the whole screenx = 1:70; y = 1:70 # make variables x.val = (x^2); y.val = yxy.out = outer(x.val, y.val, "-") # manipulate the variablesnb.col.pal = length(grDevices::hcl.pals()) # find all palettes in hcl.palsrdm.nb = sample(c(1:nb.col.pal), size = 1); print(rdm.nb) # Get random number# [1] 59xy.out = xy.out + rnorm(length(xy.out), mean = 0, sd = 200) # Prettify with RDMimage(t(xy.out),xaxt = "n", yaxt = "n", bty = "n", col = hcl.colors(1000, palette = hcl.pals()[rdm.nb])) # 61,93 looks good
png("~/Desktop/wallpapers_sunwaves.png", res = 300, units = "px",width = 1920, # add your screen resolution hereheight = 1080)# [wallpapercode here]dev.off()
Select randomly 4 values out of a vector of numbers ranging from 1 to 10 : 1. without replacement and 2. with replacement.
Use sample()
to perform this task.
sample()
sample
, students should explore the replace argumentreplicate
function to generate a lot of draws from the sample
function set.seed(12345) # Sets the random number generator to a fix valuevec.1.10 = 1:10 # Make the vector to choose from mean(vec.1.10)# [1] 5.5size = 10n = 10replicate.draws = replicate(n = n,simplify = TRUE, expr = sample(x = vec.1.10, size = size, replace = TRUE))get.average = apply(replicate.draws, 2, mean)hist(get.average)
# Check out what simplify does and deal with it replicate.draws.list = replicate(n = n,simplify = FALSE, expr = sample(x = vec.1.10, size = size, replace = TRUE))# Make into a matrixdo.call(cbind, replicate.draws.list)# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]# [1,] 2 5 4 1 10 10 1 9 5 5# [2,] 3 8 5 2 5 6 4 8 10 2# [3,] 8 3 5 9 3 8 8 8 5 5# [4,] 9 7 9 5 8 7 10 8 7 4# [5,] 9 8 4 7 6 7 2 9 4 4# [6,] 4 5 10 2 2 7 4 8 10 4# [7,] 8 4 7 5 1 8 3 10 2 8# [8,] 2 9 10 1 5 10 3 2 3 6# [9,] 4 8 2 7 1 10 8 4 8 10# [10,] 5 8 2 10 6 2 9 10 2 2do.call(rbind, replicate.draws.list)# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]# [1,] 2 3 8 9 9 4 8 2 4 5# [2,] 5 8 3 7 8 5 4 9 8 8# [3,] 4 5 5 9 4 10 7 10 2 2# [4,] 1 2 9 5 7 2 5 1 7 10# [5,] 10 5 3 8 6 2 1 5 1 6# [6,] 10 6 8 7 7 7 8 10 10 2# [7,] 1 4 8 10 2 4 3 3 8 9# [8,] 9 8 8 8 9 8 10 2 4 10# [9,] 5 10 5 7 4 10 2 3 8 2# [10,] 5 2 5 4 4 4 8 6 10 2
Select randomly 4 values out of a vector of numbers ranging from 1 to 10 : 1. without replacement and 2. with replacement.
set.seed(12345) # Sets the random number generator to a fix valuevec.1.10 = 1:10 # Make the vector to choose from sample(x = vec.1.10, size = 4, replace = FALSE) # Sample 4 nb without replacement# [1] 3 8 2 5sample(x = vec.1.10, size = 4, replace = TRUE) # Sample 4 nb with replacement# [1] 8 2 6 6
As you can see in the last example, there are 2 "6"s, since each time a random number was picked, all numbers could be randomly chosen.
1000 draws with replacement from 0 or 1, with equal probability for each.
set.seed(123); table(sample(x = 0:1, size = 1000, replace = T,prob = c(.5,.5)))# # 0 1 # 493 507
Create a data frame with variable
0
, and that they are concious about the sd they could change. x = 1:10y = 2 + 3 * xgr = rep(letters[1:2],each = 5)linear.df = data.frame(x,y,gr)plot(y~x, col = as.factor(gr), data = linear.df, pch = 19, cex = 1.5)
MASS::fractions(.25)
to convert a decimal to a fraction! MASS::fractions(.25)
to convert a decimal to a fraction! See this explation for prob vs odds
Whitlock and Schluter: "The odds of success are the probability of success divided by the probability of failure"
log(1/6) =-1.7918
and log(6/1) = 1.7918
The odds ratio and the log odds ratio is a bit like the R-squared:
Also, the Geometric mean (which is the mean of logs) is BETTER when using log-based data (qPCR, where the data doubles each round) and is less sensitive to outliers (which the arithmetic mean is actually VERY sensitive)
raw.num = c(1,2,8) # numbers NOT on the log scale (simulate a DOUBLING process like qPCR)mean(raw.num) # Arithmetic average (which would be WRONG)# [1] 3.666667mean.log = mean(log(raw.num)/log(2)) # Taking the mean of the log (this is the geometric mean)mean.log # Mean, on the log scale, for the raw numbers transformed: log(raw.num)# [1] 1.3333332^mean.log# [1] 2.519842# gm = exp(mean(log(raw.num))) # calculating the geometric mean gm = 2^(mean(log(raw.num)/log(2))) # Here we calculate the geometric mean with a base 2# Now compare the geometric mean of the raw numbers with the mean of the log, transformed back into the raw number scale gm# [1] 2.5198422^mean.log# [1] 2.519842
tot.nb.ev = 100; success = 20failure = tot.nb.ev-successp = success/tot.nb.ev; q = failure/tot.nb.ev(odds.favor = (success)/(failure)) # Odds in favor of event ( 1 in 4), same as p/q# [1] 0.25(odds.agnst = (failure)/(success)) # Odds against the event (here 4 to 1), same as q/p# [1] 4
The horse wins 1 race to 4 fails (odds of winning = 0.25). So, for 5 races it will win 1 and loose 4.
Conversely, the horse odds of failing is 4:1 (read 4 to 1). The horse is 4 times more likely to fail than to succeed in a race. Odds can also be represented as ratios (and integer ratios).
odds = 1
or log(odds) = 0
: odds < 1
or log(odds) < 0
: More odds to looseodds > 1
or log(odds > 0
: More odds to win Another example: for a 6-sided die,
If the odds of success is 1 (O=1), we can say that "the odds are one to one" or "1:1", then one success occurs for every failure. If the odds are 10, (10:1) then 10 trials result in success for every one that results in failure (Whitlock and Schluter 2020)
in French, Odds is the "cote" (https://fr.wikipedia.org/wiki/Cote_(probabilités))
tot.nb.ev = 100; success = 20failure = tot.nb.ev-success(probability.favor = (success)/(tot.nb.ev)) # probability of event (here 20%), fractions(probability.favor)# [1] 0.2(q = failure/tot.nb.ev)# [1] 0.8
tot.nb.ev = 100; success = 20failure = tot.nb.ev-success(probability.favor = (success)/(tot.nb.ev)) # probability of event (here 20%), fractions(probability.favor)# [1] 0.2(q = failure/tot.nb.ev)# [1] 0.8
# Get some LOG ODDS numbers log_odds = seq(from = -5, to = 5, by = .25)odds = exp(log_odds) # Transformed into odds # Make inverse logit function inv.logit = function(x) {exp(x)/(1 + exp(x))} # takes log odds as input. Same as function(x){1/(1 + exp(-x))}p = inv.logit(log_odds) # This is p = odds/(1 + odds) q = 1-p # Probability of failure (1-p)# Store log_odds other data to plot d = data.frame(log_odds, odds, p, q) format(x = ht(d,3),digits = 2, scientific = F)# log_odds odds p q# head.1 -5.0 0.0067 0.0067 0.9933# head.2 -4.8 0.0087 0.0086 0.9914# head.3 -4.5 0.0111 0.0110 0.9890# tail.39 4.5 90.0171 0.9890 0.0110# tail.40 4.8 115.5843 0.9914 0.0086# tail.41 5.0 148.4132 0.9933 0.0067
We first generate a series of number in LOG odds that we will transform in the probability scale (between 0 and 1)
proof for inv.logit2
What is the p-value 1. for getting a human height of 108.04 cm? 2. What is the probability of having a height between 109.9995 and 110.0005 cm? What is the p-value for someone of height 110 cm?. Below are the distributions of hypothetical heights with mean = 110 and sd = 1.
# lb ub# 1 109.9995 110.0005
What is the p-value 1. for getting a human height of 108.04 cm? 2. What is the probability of having a height between 109.9995 and 110.0005 cm? What is the p-value for someone of height 110 cm?. Below are the distributions of hypothetical heights with mean = 110 and sd = 1.
# lb ub# 1 109.9995 110.0005
The answer is 1. 0.05 or 5%, 2. 0.04%, 3. 1 or 100%!
It is still very useful to know the different types of distributions that random variables can take.
Note that (in Frequentist inference statistics)
# [1] 13.9
# [1] 13.9
# [1] 13.9
# [1] 13.9 14.5 18.1 15.1 15.3
# [1] 13.9 14.5 18.1 15.1 15.3
# [1] 13.9 14.5 18.1 15.1 15.3
# [1] 13.9 14.5 18.1 15.1 15.3
# [1] 13.9 14.5 18.1 15.1 15.3
# [1] 13.9 14.5 18.1 15.1 15.3 18.4 15.9 12.5 13.6 14.1 17.4 15.7 15.8 15.2 13.9# [16] 18.6 16.0 11.1 16.4 14.1 12.9 14.6 12.9 13.5 13.7 11.6 16.7 15.3 12.7 17.5# [31] 15.9 14.4 16.8 16.8 16.6 16.4 16.1 14.9 14.4 14.2 13.6 14.6 12.5 19.3 17.4# [46] 12.8 14.2 14.1 16.6 14.8 15.5 14.9 14.9 17.7 14.5 18.0 11.9 16.2 15.2 15.4# [61] 15.8 14.0 14.3 13.0 12.9 15.6 15.9 15.1 16.8 19.1 14.0 10.4 17.0 13.6 13.6# [76] 17.1 14.4 12.6 15.4 14.7 15.0 15.8 14.3 16.3 14.6 15.7 17.2 15.9 14.3 17.3# [91] 17.0 16.1 15.5 13.7 17.7 13.8 19.4 18.1 14.5 12.9
# [1] 13.9 14.5 18.1 15.1 15.3 18.4 15.9 12.5 13.6 14.1 17.4 15.7 15.8 15.2 13.9# [16] 18.6 16.0 11.1 16.4 14.1 12.9 14.6 12.9 13.5 13.7 11.6 16.7 15.3 12.7 17.5# [31] 15.9 14.4 16.8 16.8 16.6 16.4 16.1 14.9 14.4 14.2 13.6 14.6 12.5 19.3 17.4# [46] 12.8 14.2 14.1 16.6 14.8 15.5 14.9 14.9 17.7 14.5 18.0 11.9 16.2 15.2 15.4# [61] 15.8 14.0 14.3 13.0 12.9 15.6 15.9 15.1 16.8 19.1 14.0 10.4 17.0 13.6 13.6# [76] 17.1 14.4 12.6 15.4 14.7 15.0 15.8 14.3 16.3 14.6 15.7 17.2 15.9 14.3 17.3# [91] 17.0 16.1 15.5 13.7 17.7 13.8 19.4 18.1 14.5 12.9
# [1] 13.9 14.5 18.1 15.1 15.3 18.4 15.9 12.5 13.6 14.1 17.4 15.7 15.8 15.2 13.9# [16] 18.6 16.0 11.1 16.4 14.1 12.9 14.6 12.9 13.5 13.7 11.6 16.7 15.3 12.7 17.5# [31] 15.9 14.4 16.8 16.8 16.6 16.4 16.1 14.9 14.4 14.2 13.6 14.6 12.5 19.3 17.4# [46] 12.8 14.2 14.1 16.6 14.8 15.5 14.9 14.9 17.7 14.5 18.0 11.9 16.2 15.2 15.4# [61] 15.8 14.0 14.3 13.0 12.9 15.6 15.9 15.1 16.8 19.1 14.0 10.4 17.0 13.6 13.6# [76] 17.1 14.4 12.6 15.4 14.7 15.0 15.8 14.3 16.3 14.6 15.7 17.2 15.9 14.3 17.3# [91] 17.0 16.1 15.5 13.7 17.7 13.8 19.4 18.1 14.5 12.9
R
, there is a convention on how to call different distributions. Type | Definition | Comment |
---|---|---|
p | probability (cumulative distribution function or CDF; gives you the probability for a given quantile) | gives the probability that a random variable takes in a certain range of values for that random variable |
q | quantile (inverse of CDF; gives you the x value for a given probability) | gives the position on the x axis where the area under the curve is a certain probability |
d | density function (probability density function or PDF) | gives probability that a random variables have for a specific value if the random variable |
r | random variable coming from a certain distribution | We use this to generate random numbers from a specific distribution |
R
, there are many functions that are available to play with these distributions, or even extract some pseudo-randomly generate values and use them to perform simulations. d
, p
, r
, q
functions for the normal distribution. lower.tail = TRUE
by default. It means that the probability is calculated from the "left" whereas lower.tail = FALSE
means that the probability will be calculated from the "right". # probability of > 2 people with the same birthday(a = pbirthday(23, coincident = 3))# [1] 0.01441541# 0.9 probability of >=3 coincident birthdays(a = qbirthday(coincident = 3, prob = 0.9)) # Gives the number of people# [1] 135# Chance of >=4 coincident birthdays in 300 people(a = pbirthday(300, coincident = 4))# [1] 0.9740397
pbirthday
(birthday paradox)## from Diaconis & Mosteller p. 858. 'coincidence' is that husband, wife, daughter all born on the 16thqbirthday(classes = 30, coincident = 3) # approximately 18# [1] 18coin = 2n.seq = 1:qbirthday(p = 0.999,coincident = coin)get.probs = mapply(pbirthday, n = n.seq, classes = 365, coincident = coin)plot(get.probs~n.seq, type = "l"); abline(h = 0.5, v = qbirthday(p = 0.5,coincident = coin), lty = 3)
# birthday paradox# scaling exponents in mathematicsn.people = 23day.of.year = sample(1:365,n.people, replace = TRUE)day.of.year# [1] 82 111 164 315 237 8 144 15 264 54 80 100 150 221 330 141 61 139 193# [20] 265 236 144 52happy.birthday = function(n.people = 20) { day.of.year = sample(1:365,n.people, replace = TRUE) length(which(table(day.of.year)>2))}barplot(table(replicate(10,happy.birthday(n.people = 40))))
pbirthday(40,classes = 365)# [1] 0.8912318prod(c(1-(1:182/365)))# [1] 4.780018e-25hbirthd <- function(n.p = 5) { 1-prod(365:c(365-c(n.p-1)))/365^n.p}n.p = 2:65plot(x = n.p,mapply(hbirthd, n.p = n.p), type = "l");abline(h = 0.5, v = qbirthday(), lty = 3)
Distribution | Probability | Quantile | Density | Random | Parameters | PDF or PMF |
---|---|---|---|---|---|---|
Beta | pbeta |
qbeta |
dbeta |
rbeta |
shape1, shape2 | f(x)=Γ(α+β)Γ(α)Γ(β),0<x<1 |
Binomial | pbinom |
qbinom |
dbinom |
rbinom |
sample size, prob. | f(x) = {n \choose x} p^x(1-p)^{n-x}, x = 0,1,2,...,n |
Cauchy | pcauchy |
qcauchy |
dcauchy |
rcauchy |
location, scale | f(x) = \frac{1}{\pi} \frac{1}{x^2+1}, -\infty \lt x \lt \infty |
Chi-Square | pchisq |
qchisq |
dchisq |
rchisq |
df | f(x) = x^{(k/2)-1} \frac{e^{-x/2}}{( 2^{k/2} \Gamma(k/2) )}, x \gt 0 |
Exponential | pexp |
qexp |
dexp |
rexp |
rate | f(x) = \lambda e^{-\lambda x} (when x \geq 0, otherwise, 0) |
F (Fisher's F) | pf |
qf |
df |
rf |
df1, df2 | f(x) = \frac{\Gamma[(r_1+r_2)/2](r_1/r_2)^{r_1/2}}{\Gamma(r_1/2)\Gamma(r_2/2)}\frac{(x)^{r_1/2-1}}{(1+r_1x/r_2)^{(r_1+r_2)/2)}}, x \ge 0 |
Gamma | pgamma |
qgamma |
dgamma |
rgamma |
shape | f(x) = \frac{\lambda^c x^{c-1}e^{-\lambda x}}{\Gamma(c)}, x \gt 0 |
Geometric | pgeom |
qgeom |
dgeom |
rgeom |
prob. | f(x) = p(1-p)^{x}, x = 0,1,2,... |
Hypergeometric | phyper |
qhyper |
dhyper |
rhyper |
m, n, k | f(x) = \frac{{N-D\choose n-x}{D \choose x}}{{N \choose n}}, x = 0,1,2,...,n |
Logistic | plogis |
qlogis |
dlogis |
rlogis |
location, scale | f(x) = \frac{e^{a+bx}}{1+e^{a+bx}}, -\infty \lt x \lt \infty |
Log Normal | plnorm |
qlnorm |
dlnorm |
rlnorm |
mean, sd | f(x)={\frac{1}{x\sigma\sqrt{2\pi}}}e^{-\Bigl(\frac {(ln(x)-\mu)^2}{2\sigma^2}\Bigr)} |
Negative Binomial | pnbinom |
qnbinom |
dnbinom |
rnbinom |
size, prob. | f(x) = {x+r-1 \choose r-1}p^r(1-p)^x, x = 0,1,2,... |
Normal | pnorm |
qnorm |
dnorm |
rnorm |
mean, sd | f(x)={\frac{1}{\sigma\sqrt{2\pi}}}e^{-{\frac{1}{2}}\Bigl(\frac {x-\mu}{\sigma}\Bigr)^2}, -\infty \lt x \lt \infty |
Poisson | ppois |
qpois |
dpois |
rpois |
mean | f(x) = e^{-\lambda}\frac{\lambda^x}{x!}, x = 0,1,2,... |
t (Student's t) | pt |
qt |
dt |
rt |
df | f(x) = \frac{\Gamma[(r+1)/2]}{\sqrt{\pi r}\Gamma(r/2)}\frac{1}{(1+x^2/r)^{(r+1)/2}} |
Studentized Range | ptukey |
qtukey |
dtukey |
rtukey |
You just don't want to see it... | |
Uniform | punif |
qunif |
dunif |
runif |
min, max | f(x) = \frac{1}{b-a} for x \in [a,b], 0 otherwise |
Weibull | pweibull |
qweibull |
dweibull |
rweibull |
shape | f(x) = \frac{k}{\alpha} \frac{(x-\mu)^{k-1}}{\alpha}e^{-((x-\mu)/\alpha)^k}, x \geq \mu;k.\alpha \gt 0 |
Distribution | Probability | Quantile | Density | Random | Parameters | PDF or PMF |
---|---|---|---|---|---|---|
Beta | pbeta |
qbeta |
dbeta |
rbeta |
shape1, shape2 | f(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}, 0 \lt x \lt 1 |
Binomial | pbinom |
qbinom |
dbinom |
rbinom |
sample size, prob. | f(x) = {n \choose x} p^x(1-p)^{n-x}, x = 0,1,2,...,n |
Cauchy | pcauchy |
qcauchy |
dcauchy |
rcauchy |
location, scale | f(x) = \frac{1}{\pi} \frac{1}{x^2+1}, -\infty \lt x \lt \infty |
Chi-Square | pchisq |
qchisq |
dchisq |
rchisq |
df | f(x) = x^{(k/2)-1} \frac{e^{-x/2}}{( 2^{k/2} \Gamma(k/2) )}, x \gt 0 |
Exponential | pexp |
qexp |
dexp |
rexp |
rate | f(x) = \lambda e^{-\lambda x} (when x \geq 0, otherwise, 0) |
F (Fisher's F) | pf |
qf |
df |
rf |
df1, df2 | f(x) = \frac{\Gamma[(r_1+r_2)/2](r_1/r_2)^{r_1/2}}{\Gamma(r_1/2)\Gamma(r_2/2)}\frac{(x)^{r_1/2-1}}{(1+r_1x/r_2)^{(r_1+r_2)/2)}}, x \ge 0 |
Gamma | pgamma |
qgamma |
dgamma |
rgamma |
shape | f(x) = \frac{\lambda^c x^{c-1}e^{-\lambda x}}{\Gamma(c)}, x \gt 0 |
Geometric | pgeom |
qgeom |
dgeom |
rgeom |
prob. | f(x)=p(1-p)^{x}, x = 0,1,2,... |
Hypergeometric | phyper |
qhyper |
dhyper |
rhyper |
m, n, k | f(x) = \frac{{N-D\choose n-x}{D \choose x}}{{N \choose n}}, x = 0,1,2,...,n |
Logistic | plogis |
qlogis |
dlogis |
rlogis |
location, scale | f(x) = \frac{e^{a+bx}}{1+e^{a+bx}}, -\infty \lt x \lt \infty |
Log Normal | plnorm |
qlnorm |
dlnorm |
rlnorm |
mean, sd | f(x)={\frac{1}{x\sigma\sqrt{2\pi}}}e^{-\Bigl(\frac {(ln(x)-\mu)^2}{2\sigma^2}\Bigr)} |
Negative Binomial | pnbinom |
qnbinom |
dnbinom |
rnbinom |
size, prob. | f(x) = {x+r-1 \choose r-1}p^r(1-p)^x, x = 0,1,2,... |
Normal | pnorm |
qnorm |
dnorm |
rnorm |
mean, sd | f(x)={\frac{1}{\sigma\sqrt{2\pi}}}e^{-{\frac{1}{2}}\Bigl(\frac {x-\mu}{\sigma}\Bigr)^2}, -\infty \lt x \lt \infty |
Poisson | ppois |
qpois |
dpois |
rpois |
mean | f(x) = e^{-\lambda}\frac{\lambda^x}{x!}, x = 0,1,2,... |
t (Student's t) | pt |
qt |
dt |
rt |
df | f(x) = \frac{\Gamma[(r+1)/2]}{\sqrt{\pi r}\Gamma(r/2)}\frac{1}{(1+x^2/r)^{(r+1)/2}} |
Studentized Range | ptukey |
qtukey |
dtukey |
rtukey |
You just don't want to see it... | |
Uniform | punif |
qunif |
dunif |
runif |
min, max | f(x) = \frac{1}{b-a} for x \in [a,b], 0 otherwise |
Weibull | pweibull |
qweibull |
dweibull |
rweibull |
shape | f(x) = \frac{k}{\alpha} \frac{(x-\mu)^{k-1}}{\alpha}e^{-((x-\mu)/\alpha)^k}, x \geq \mu;k.\alpha \gt 0 |
Distribution | Probability | Quantile | Density | Random | Parameters | PDF or PMF |
---|---|---|---|---|---|---|
Bernouli | f(x) = p^x(1-p)^{1-x}, x = 0,1 | |||||
Wilcoxon Rank Sum Statistic | pwilcox |
qwilcox |
dwilcox |
rwilcox |
m, n | |
Wilcoxon Signed Rank Statistic | psignrank |
qsignrank |
dsignrank |
rsignrank |
see here |
0
, 5
, 19
,...)0.1
,0.62113
,...)Distribution | Definition | Data type | Example |
---|---|---|---|
Binomial | probability of n repeated yes/no experiments with probability p of success | discrete; yes/no; 0/1 | Number of heads in a row after 3 tosses |
Poisson | probability of a number of cases happening in a defined range of time or space given a constant rate | discrete; | ticks of a radiation counter, DNA mutation, colonization of animals on remote islands (something per unit time or space for example) |
Logistic | probability of an individual to be in one class or another (link function in GLM) | continuous; | alive or not, yes or no, diseased or not, etc. |
Normal | probability function which informs how values are distributed, centered on a mean, being symmetric, and having a specific variance | continuous; | height of individuals, beak length, shoe size, time required to run a kilometer, the data shows a 'bell-shaped' curve |
Uniform continuous | probability of all outcomes are equal | continuous; | angle between 0-360 |
Uniform discrete | probability of all outcomes are equal | discrete; | Faces of a dice |
Exponential | times it takes, starting from now, before something happens | continuous; | Used in maximum likelihood, Generating genealogies, continuous rate |
This figure contains a decision graph that could help figure out what distribution you are facing. BUT, it doesn't guarantee that you'll find the right distribution.
Careful!
The image below shows a small roadmap demonstrating the link between different statistical distributions.
base R
.R
even if they haven't their own function name (e.g., Bernoulli). size = 1
.It assumes that:
The probability mass function is:
f(x) = {n \choose x} p^x(1-p)^{n-x} where
The probability of success or failure DON'T change as the experiment goes: e.g., the probability of heads or tail for a coin should not change as the experiment is conducted.
See the lecture Binomial Distribution
choose()
is used in R to find the number combinations (regardless of the order, meaning that it could be found in any position). For example, if you just want to know the number of combinations you can have from choosing 6 numbers out of 49, it is 1.3983816\times 10^{7}. If you wish to know the chances of winning the lotteries it's 7.1511238\times 10^{-8}.choose(8,3)
or 56factorial(4)
which in this case gives 24.par(mfrow=c(1,2))x <- 0:6plot(x,factorial(x), type="s", # step option log="y", # Get logarithmic scale for y main="factorial x")abline(h = seq(0,1000,by = 25), lty = 3)x = 0:8plot(x,choose(8,x), type="s", main="Binomial coefficients")abline(h = seq(0,70,by = 10), lty = 3)
R
R
, you can generate binomial distributed variables with the rbinom
function. rbinom(n = number of repetitions, # or number of 'experiments' or observations size = sample size, # between 0 and up to the size (0:size) p = probability of success)
set.seed(12345)# 1000 experiments where each time, e.g., flipping a coin, I can either have atable(rbinom(n = 1000, size=1, prob=0.5), dnn=NULL) # success or failure with p=.5# 0 1 # 469 531# 1 experiment with 1000 coins summing all successes with p=.5rbinom(n = 1, size=1000, prob=0.5)# [1] 476# 1000 experiments where each time, for example flipping 10 coins, where I rbind(random = table(rbinom(n = 1000, size=10, prob=0.5), dnn=NULL)/1000,# sum the success with p=.5 theory = round(dbinom(x = 0:10, size=10, prob=0.5),3))# 0 1 2 3 4 5 6 7 8 9 10# random 0.001 0.009 0.045 0.105 0.191 0.260 0.211 0.107 0.058 0.012 0.001# theory 0.001 0.010 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.010 0.001
R
notation.dbinom(x = 6,7,prob = .5)# [1] 0.0546875
dbinom(x = 3,5,prob = .5) # [1] 0.3125
genotype examples The Punnett square for the male and female union
Gametes | A | a |
---|---|---|
A | AA | Aa |
Daughter examples
dbinom(2,3,prob = .5)# [1] 0.375
000 # 0 ## 1/8 1/8001 # 1 ## 1/8 #010 # 1 ## 1/8 # 3/8100 # 1 ## 1/8 #011 # 2 ## 1/8 ##110 # 2 ## 1/8 ## 3/8101 # 2 ## 1/8 ##111 # 3 ## 1/8 ### 1/8
plot(0:8, dbinom(x = 0:8, size=7, prob=0.5),type='h', col = col.b, lwd=lwd, ylim = c(0,.4), ylab = "Probability", main = "s=7 (blue) and s=5 (red); p=0.5 "); add.l(.05)points(x = 6, y = dbinom(x = 6, size=7, prob=0.5), pch =19, col = col.b)lines(c(0:8)+0.1, dbinom(x = 0:8, size=5, prob=0.5),type='h', col = col.r, lwd=lwd)points(x = 3+.1, y = dbinom(x = 3, size=5, prob=0.5), pch =19, col = col.r)legend("topright",legend = c("Genetics","Daughter"), lty =c(1), col = c("blue","red"), lwd = 3)
dbinom(x = 6,7,prob = .5)# [1] 0.0546875
dbinom(x = 3,5,prob = .5) # [1] 0.3125
# For dbinom, x is the vector of quantiles plot(0:1, dbinom(x = 0:1, size=1, prob=0.5),type='h', col=col.b, lwd=lwd, ylim=c(0,1), ylab="Probability", main = "s=1, p=.5"); add.l()plot(0:20, dbinom(x=0:20, size=20, prob=0.4), type='h', col=col.b, lwd=lwd, ylim=c(0,.2), ylab="", main = "s=20, p=.4");add.l()plot(0:23, dbinom(x=0:23, size=20, prob=0.9), type='h', col=col.b, lwd=lwd, ylim=c(0,0.3), ylab="", main = "s=20, p=.9");add.l()plot(40:110, dbinom(x=40:110, size=150, prob=0.5), type='h', col=col.b, lwd=lwd, ylim=c(0,.15), ylab="Probability", main="s=150, p=.5");add.l()plot(40:51, dbinom(x = 40:51, size=50, prob=0.99),type='h', col = col.b, lwd=lwd, ylim = c(0,.8), ylab = "", main = "s=50, p=.99");add.l()hist(rbinom(n = 100000, size=50, prob=0.99),xlab = "Nb of successes", breaks = 100, col = col.b, main = "s=50, p=.99", probability = F)
hist(rbinom(n = 100000, size=50, prob=0.5), xlab = "Nb of successes", breaks = 100, col = col.b, main = "s=50, p=.5", probability = F)
What is the probability (in %), of getting 4 heads and 4 tails from 4 tosses of a fair coin? Use the binomial distribution to answer this question.
What is the probability (in %), of getting 4 heads or 4 tails from 4 tosses of a fair coin?
dbinom(x = 0, size = 4, prob = .5) + dbinom(x = 4, size = 4, prob = .5)# [1] 0.125curve(dbinom(x,4,.5),0,6, n = 7,type = "h", lwd = lwd)curve(dbinom(x,4,.5),0,0, type = "h",add = TRUE, col = "red", lwd = lwd)curve(dbinom(x,4,.5),4,4, type = "h",add = TRUE, col = "red", lwd = lwd)
mybin <- function(x,n,p) { choose(n, x)* p^x* (1-p)^(n-x) } mybin(0,4,.5) + mybin(4,4,.5)# [1] 0.1250.5^4*2# [1] 0.125
To be Poisson distributed, the data generally assumes that
The probability mass function is:
f(x) = e^{-\lambda}\frac{\lambda^x}{x!} where \mu = \sigma^2 = \lambda
R
R
, you can generate Poisson distributed variables with the rpois
function. Below is the structure of the function. rpois(n = number of repetitions, lambda = rate) # doesn't need to be an integer
set.seed(5937); rpois(n = 1,lambda = 2)# [1] 5
add.l <- function(by=.1) { abline(h=seq(0,1, by = by), lty = 3, lwd = .3)}plot(0:5, dpois(x = 0:5, lambda=0.05),type='h', col = col.b, lwd=lwd, ylim = c(0,1), ylab = "Probability", main = "lambda = 0.05");add.l();abline(v = 0.05, lty = 3)plot(0:10, dpois(x = 0:10, lambda=1),type='h', col = col.b, lwd=lwd, ylim = c(0,1), ylab = "Probability", main = "lambda = 1"); add.l();abline(v = 1, lty = 3)plot(0:20, dpois(x = 0:20, lambda=5),type='h', col = col.b, lwd=lwd, ylim = c(0,0.3), ylab = "Probability", main = "lambda = 5");add.l();abline(v = 5, lty = 3)plot(0:40, dpois(x = 0:40, lambda=20),type='h', col = col.b, lwd=lwd, ylim = c(0,.2), ylab = "Probability", main = "lambda = 20");add.l();abline(v = 20, lty = 3)hist(rpois(n = 1000, lambda=25), xlab = "Nb of successes", breaks = 100, col = col.b, main = "n = 1000, lambda = 25", probability = F);abline(v = 25, lty = 3)hist(rpois(n = 100000, lambda=50),xlab = "Nb of successes", breaks = 100, col = col.b, main = "n = 100000, lambda = 50", probability = F);abline(v = 50, lty = 3)
the last 2 graphs are the pseudo-randomly generated numbers from a Poisson distribution
How would you calculate the probability of getting X number of events or more from these graphs.
ppois()
function as this just does that.)# [1] 0.6766764# [1] 0.6766764
# Define uniform discrete dunifdisc<-function(x, min=0, max=1) ifelse(x>=min & x<=max & round(x)==x, 1/(max-min+1), 0)runifdisc<-function(n, min=0, max=1) sample(min:max, n, replace=T)curve(dunifdisc(x, 7,10), type = "h",from=6, to=11, col="black", xlab = "", lwd=1, ylim = c(0,.5), ylab = "Probability", main = "Uniform discrete"); title(xlab = "x",line=2.2, cex.lab=1.2)
set.seed(12345)sample(7:10, size = 1,replace = T)# [1] 8runifdisc(1,7,10)# [1] 9
x = seq(0,360,by = .1)y <- dunif(x, min = min(x),max = max(x))plot(y~x, type = "n", ylim = c(0,1.5*max(y)), ylab = "Density", xlab = "", main = "Uniform continuous")polygon(c(x, rev(x), 0), c(y, rep(0,length(y)), 0), col=scales::alpha("blue",.5))title(xlab = "x",line=2.2, cex.lab=1.2)
set.seed(12345)runif(n = 1, min = 0, max = 360)# [1] 259.5254
The general form of the probability density function for a normal distribution is
f(x)={\frac{1}{\sigma\sqrt{2\pi}}}e^{-{\frac{1}{2}}\Bigl(\frac {x-\mu}{\sigma}\Bigr)^2}
The important part here is the mean \mu and the standard deviation \sigma. Usually summarized as N(\mu, \sigma). However, note that sometimes, \sigma is the variance, but in R, it is the standard deviation.
Here is a link to play with a Normal distribution interactively.
# https://www.youtube.com/watch?v=4PDoT7jtxmw&ab_channel=3Blue1Brownx = seq(-10,10,by = .1)s = 2 # ~ related to SDplot(x, exp(-s*x^2), type = "l")m = 0 # mean plot(x, exp(-(1/s^2)*(x-m)^2), type = "l")
iris
dataset contains values that are 'drawn' from a normal distribution. curve(expr = dnorm(x = x, mean=0,sd=1), from = -5, to = 5, ylim = c(0,1), col="black", ylab = "Density", main = "Density normal") abline(h=seq(0,1, by = .1), lty = 3, lwd = .3)
dnorm
, we are going to show some properties of the Norm. distribution. curve(expr = dnorm(x = x, mean=0,sd=1), from = -5, to = 5, ylim = c(0,1), col="black", ylab = "Density", main = "Density normal") abline(h=seq(0,1, by = .1), lty = 3, lwd = .3)
For the density distribution, the area under the curve between two points is the probability. For example, the 95% probability (red region), and 2 tails 2.5% (blue region).
dnorm
, we are going to show some properties of the Norm. distribution. The y axis is the DENSITY of finding the values. It can RANGE from 0 to large values (not necessarily 1). The important point is that the TOTAL AREA UNDER THE CURVE is 1, so that it can represent a probability, for a density curve.
In this plot, there is the same standard normal curve, but some important values are plotted as well
qnorm(.025) = -1.96
qnorm(1-.025) = 1.96
pnorm(qnorm(1-.025)) = 0.975
If X is a random variable distributed with a Standard normal distribution, what is the probability of finding X less or equal to 1.645. In other words, P(X\leq 1.645) where X\sim N(0,1)?
# Gives the probability at X pnorm(1.645) # [1] 0.9500151# Gives X at a prob.qnorm(p = 0.05, lower.tail = F) # [1] 1.644854# Value for 5% in both tailsqnorm(p = 0.025, lower.tail = F) # [1] 1.959964
lower.tail = T
means that we are calculating the probability (area under the curve) from the left of the graph up to the specified value lower.tail = F
means that the probability is calculated from the right of the graph down to the specified value lower.tail
argument is a logical; if TRUE (default), probabilities are P[X \leq x] otherwise, P[X \gt x]. Here lower.tail=F
since I wanted to show the TOP probability (from the UPPER tail)If X is a random variable distributed with a Standard normal distribution, what is the probability of finding X less or equal to 1.645. In other words, P(X\leq 1.645) where X\sim N(0,1)?
# Gives the probability at X pnorm(1.645) # [1] 0.9500151# Gives X at a prob.qnorm(p = 0.05, lower.tail = F) # [1] 1.644854# Value for 5% in both tailsqnorm(p = 0.025, lower.tail = F) # [1] 1.959964
lower.tail = T
means that we are calculating the probability (area under the curve) from the left of the graph up to the specified value lower.tail = F
means that the probability is calculated from the right of the graph down to the specified value lower.tail
argument is a logical; if TRUE (default), probabilities are P[X \leq x] otherwise, P[X \gt x]. Here lower.tail=F
since I wanted to show the TOP probability (from the UPPER tail)This is the visual representation of what is on the left for the 5% probability. The value of x between the blue and red polygon is ~1.64.
# The function is not shown, but can be found in the script (Markdown)draw.normal(where = "both", prob = 0.05/2) # 2.5% in each taildraw.normal(where = "both", prob = 0.2/2 ) # 10% in each tail draw.normal(where = "both", prob = 0.5/2 ) # 25% in each tail draw.normal(where = "both", prob = 0.95/2) # 47.5 % in each taildraw.normal(where = "left", prob = 0.05 ) # 5% only in the left tail draw.normal(where = "right", prob = 0.05 ) # 5% only in the right tail
pnorm
)Get the probability of finding data relative to a standard deviation number
sd = 1probability.left.side = (pnorm(q = c(sd*1, sd*2, sd*3), lower.tail = F))probability.right.side = (pnorm(q = c(sd*1, sd*2, sd*3), lower.tail = T))percent.data.under.curve = probability.right.side - probability.left.sidep.from.mean = round(percent.data.under.curve*100,2)
qnorm
)Find the "x" value (quantile) based on the probability (area under the curve between 2 values) of a standard normal distribution.
qnorm(p = c(.75, .95,.975, .995), mean = 0, sd = 1, lower.tail = F) # notice the "lower.tail"# [1] -0.6744898 -1.6448536 -1.9599640 -2.5758293qnorm(p = c(.75, .95,.975, .995), mean = 0, sd = 1, lower.tail = T)# [1] 0.6744898 1.6448536 1.9599640 2.5758293
qnorm(p = 0;.75, mean = 0, sd = 1, lower.tail = F)
means the value "x" at which 75% of the data (area under the normal curve) is below that value qnorm(p = 1-c(.75, .95,.975, .995), mean = 0, sd = 1, lower.tail = F)
)What is shown here is the OVERLAP of the 2 distributions
It is NOT a problem if the y value in the probability density function (PDF) is GREATER than one.
How could we improve our CONFIDENCE in telling that there is a REAL difference between our sample and the null hypothesis.
After increasing the sample size, you can see that the AREAS under the sample distribution have shifted to having a better POWER, and reducing the type II error.
Let's say you want to simulate the length of different beak sizes for 10 birds.
Use rnorm()
to generate 10 random normal numbers.
rnorm()
Let's say you want to simulate the length of different beak sizes for 10 birds.
Use rnorm()
to generate 10 random normal numbers.
set.seed(1234)n <-10rnorm(n)# [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559# [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378
set.seed(1234)n <-10rnorm(n) + 15
What is the probability of colonizing an island if a species has a 1 in a million (1/1000000) chance of colonizing it and that we wait 1000000 (1 million) years?
Hint: think of colonization as a success.
Discussion points
This is an easy challenge that can be made in a more interactive way: use "think-pair-share" : think individually, pair up and code the answer. Then share your thought process with everyone.
size = 1000000prob = 1/1000000x = seq(0,10,by = 1)plot(dbinom(x = x, size = size, prob = prob)~x, type = "h", lwd = 4, ylab = "Probability", xlab = "Event")points(dbinom(x = x[-1], size = size, prob = prob)~x[-1], type = "h", lwd = 4, col = "red")
# This is the probability pbinom(0,size = size,prob = prob,lower.tail = F)# [1] 0.6321207# pbinom(0,size = size,prob = prob,lower.tail = T)
It has been said that if "physicists have general relativity, chemists have the periodic table of elements and biologists have evolution or DNA, statisticians have the central limit theorem" (Nathan Uyttendaele).
The Central limit theorem (CLT) stipulates that:
We will use simulations to test this assertion!
This is important: the more you ADD identically distributed variables together, the more their sum will tend towards a new distribution which is normal.
See explaination in French in this video: Le théorème central limite - best of des aventures d'Albert This is from Nathan Uyttendaele: "Physicists have general relativity and the elegant Maxwell equations, Chemists have the periodic table of elements and Biologists have evolution or the fascinating DNA molecule and statisticians have the central limit theorem!", from the original "Les physiciens ont la relativité générale et les élégantes équations de Maxwell, les chimistes ont le tableau périodique des éléments et les biologistes ont l'évolution ou encore la fascinante la molécule d'ADN et les statisticiens ont le théorème central limite!"
See
Make a histogram of the sum of the values from rolling a set of 1, 2, 10 and 50 die 10,000 times.
Hint: to roll a dice, use the following code
set.seed(42)sample(1:6, size = 1, replace = TRUE)
Make a histogram of the sum of the values from rolling a set of 1, 2, 10 and 50 die 10,000 times.
set.seed(12345)roll = 1e4 # Nb of times we do the experiment (characterize the underlying r.v.)for (nb.dice in c(1,2,10,50)) { res.exp = replicate(roll, simplify = T, sample(1:6, size = nb.dice, replace = TRUE)) if(nb.dice == 1){Xsum = res.exp } else {Xsum = apply(res.exp, 2, sum)} # add independent r.v. (sum of cols) hist(Xsum, main=paste("n =",roll,ifelse(nb.dice==1,"dice =","die ="),nb.dice))}
set.seed(12345)roll = 1e4 # Nb of times we do the experiment (characterize the underlying r.v.)for (nb.dice in c(1,2,10,500)) { res.exp = replicate(roll, simplify = T, sample(1:6, size = nb.dice, replace = TRUE)) if(nb.dice == 1){Xsum = res.exp; Xmean = (res.exp)} else { Xsum = apply(res.exp, 2, sum) # add independent r.v. (sum of cols) Xmean = apply(res.exp, 2, mean)} hist((Xmean-mean(Xmean))/(sd(Xmean)), xlim = c(-4,4), ylim = c(0,1), main=paste("n =",roll,ifelse(nb.dice==1,"dice =","die ="),nb.dice), probability = T)curve(dnorm, from = -10, to =10, add =TRUE)}
n = 1000 # Number of points # Generate multiple additions of random variables for(i in c(2, 50, 1000, 5000)){ clt = replicate(i, rexp(n, rate = 1), simplify = FALSE) hist(apply(do.call(cbind,clt),1,sum), main = paste("Hist. of",i,"variables"), xlab = "x") # Draw the histogram }
# Generate multiple additions of random variables for(i in c(2, 50, 1000, 5000)){ clt = replicate(i, runifdisc(n),simplify = FALSE) hist(apply(do.call(cbind,clt),1,sum), main = paste("Histogram of",i,"variables"),xlab = "x") # Draw the histogram }
This is an example of a uniform discrete distribution ranging from 7 to 10.
n = 1e6 # Number of points # Generate multiple additions of random variables clt = replicate(n = 20, # flip 20 dice runifdisc(n = n, min = 1, max = 6), simplify = FALSE)sum.rdm.var = apply(do.call(what = cbind, args = clt), 1, sum) mean.rdmv = mean(sum.rdm.var); sd.rdmv = sd(sum.rdm.var) # mean and sdhist(sum.rdm.var, main = paste("Histogram of",20,"variables"), xlab = "x")curve(expr = dnorm(x,mean = mean.rdmv,sd = sd.rdmv), from = mean.rdmv-5*sd.rdmv, to = mean.rdmv+5*sd.rdmv,ylab = "Density")
pnorm(100,mean = 70,sd = sd.rdmv,lower.tail = F)
which is 4.332619\times 10^{-5} (0.0000433 or 0.0043%)!n = 1e6 # Number of points # Generate multiple additions of random variables clt = replicate(n = 20, # flip 20 dice runifdisc(n = n, min = 1, max = 6), simplify = FALSE)sum.rdm.var = apply(do.call(what = cbind, args = clt), 1, sum) mean.rdmv = mean(sum.rdm.var); sd.rdmv = sd(sum.rdm.var) # mean and sdhist(sum.rdm.var, main = paste("Histogram of",20,"variables"), xlab = "x")curve(expr = dnorm(x,mean = mean.rdmv,sd = sd.rdmv), from = mean.rdmv-5*sd.rdmv, to = mean.rdmv+5*sd.rdmv,ylab = "Density")
pnorm(100,mean = 70,sd = sd.rdmv,lower.tail = F)
which is 4.332619\times 10^{-5} (0.0000433 or 0.0043%)!# If you dived by the number of dice that were rolled, you can see that it actually tends towards the mean or rolling 1 die a lot of time: 3.5! # hist(sum.rdm.var/i, main = paste("Histogram of",i,"variables"), xlab = "x") underlying distribution
Galton's board (quincunx) is a representation of a lot of Bernoulli trials that will form a normal distribution
Below is a simulation of 1000 balls, hitting 500 pegs that were left from the point "0". The normal curve was drawn with a standard deviation of \sqrt{n}-1
set.seed(24601) # setting this so the random results will be repeatable library(MASS)covmat <- matrix(c(1.0, 0.2, 0.6, # variance covariance matrix of the data 0.2, 2.0, -0.5, # Positive-definite symmetric matrix 0.6, -0.5, 1.0), nrow=3) data <- mvrnorm(n = 300, mu = c(1,-1,0), # mean of the data Sigma=covmat) # generate random data that match that variance covariance matrixplot(data[,1:2], pch = 19, col = b.5); abline(h=0,v=0,lty = 3)plot(data[,2:3], pch = 19, col = b.5); abline(h=0,v=0,lty = 3)
ggplot2::autoplot(vegan::rda(data)) + theme_classic() + theme(plot.margin = unit(c(0,0.5,0,0.5), "cm"), legend.position="none")+ geom_hline(yintercept=0, linetype="dotted",color = "grey50", size=0.5)+ geom_vline(xintercept=0, linetype="dotted",color = "grey50", size=0.5)
faux
package can help you simulate factorial designs and multiple normal correlated variables (see rnorm_multi
). library(faux); set.seed(12345); n = 500 #number of values to generate for the variable xtraits = rnorm_multi(n = n, # ??rnorm_multi mu = c(11.317531, 10.470744, 9.377967), # Mean of 3 variables sd = c( 0.6247866, 0.7134755, 0.5502800), # SD of variables r = c(0.4887644, 0.4678446, 0.7056161), # correlation between variables varnames = c("Beak length", "Beak depth", "Beak width"), empirical = TRUE)plot(traits[,1:2],col=b.5,pch=19,main="Simulated data",xlab="");title(xlab="Beak length",ylab="",line=2.2,cex.lab=1.2)
cov(traits); cor(traits) # you can verify that these were the input in the function # Beak length Beak depth Beak width# Beak length 0.3903583 0.2178765 0.1608485# Beak depth 0.2178765 0.5090473 0.2770329# Beak width 0.1608485 0.2770329 0.3028081# Beak length Beak depth Beak width# Beak length 1.0000000 0.4887644 0.4678446# Beak depth 0.4887644 1.0000000 0.7056161# Beak width 0.4678446 0.7056161 1.0000000
\mathbf{\rho} = \mathbf{V}^{1/2}\mathbf{\Sigma}\mathbf{V}^{1/2}
\left(\begin{array}{ccc}\sigma_{11} & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_{22} & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_{33}\end{array}\right) = \left(\begin{array}{ccc}\sqrt{\sigma_{11}} & 0 & 0 \\ 0 & \sqrt{\sigma_{22}} & 0 \\ 0 & 0 & \sqrt{\sigma_{33}}\end{array}\right) \left(\begin{array}{ccc}\rho_{11} & \rho_{12} & \rho_{13} \\ \rho_{21} & \rho_{22} & \rho_{23} \\ \rho_{31} & \rho_{32} & \rho_{33}\end{array}\right) \left(\begin{array}{ccc}\sqrt{\sigma_{11}} & 0 & 0 \\ 0 & \sqrt{\sigma_{22}} & 0 \\ 0 & 0 & \sqrt{\sigma_{33}}\end{array}\right)
\left(\begin{array}{ccc}\rho_{11} & \rho_{12} & \rho_{13} \\ \rho_{21} & \rho_{22} & \rho_{23} \\ \rho_{31} & \rho_{32} & \rho_{33}\end{array}\right)=\left(\begin{array}{ccc}\sqrt{\sigma_{11}} & 0 & 0 \\ 0 & \sqrt{\sigma_{22}} & 0 \\ 0 & 0 & \sqrt{\sigma_{33}}\end{array}\right)^{-1}\left(\begin{array}{ccc}\sigma_{11} & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_{22} & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_{33}\end{array}\right)\left(\begin{array}{ccc}\sqrt{\sigma_{11}} & 0 & 0 \\ 0 & \sqrt{\sigma_{22}} & 0 \\ 0 & 0 & \sqrt{\sigma_{33}}\end{array}\right)^{-1}
set.seed(123) # setting this so the random results will be repeatable library(MASS)# Simulating 3 traits for 4 different species n = 200Amat1 = MASS::mvrnorm(n, mu = c(11.2,11.8,9.91), Sigma = diag(c(1.31,1.01,1.02))) Amat2 = MASS::mvrnorm(n, mu = c(7.16,8.54,6.82), Sigma = diag(c(0.445,0.546,0.350)))Amat3 = MASS::mvrnorm(n, mu = c(15.6,14.6,13.5), Sigma = diag(c(1.43,0.885,0.990)))Amat4 = MASS::mvrnorm(n, mu = c(8.65,14.1,8.24), Sigma = diag(c(0.535,0.844,0.426)))Amat = rbind(Amat1,Amat2,Amat3,Amat4)Amat.gr = cbind(Amat, gl(4,k=n,labels = c(1,2,3,4)))# by(Amat.gr[,1:3],INDICES = Amat.gr[,4],FUN = cov) # calc. cov. mat. for all gr summary(m1 <- prcomp(Amat, scale= T))# Importance of components:# PC1 PC2 PC3# Standard deviation 1.5566 0.6977 0.3005# Proportion of Variance 0.8076 0.1623 0.0301# Cumulative Proportion 0.8076 0.9699 1.0000
# biplot(m1, xlabs=rep(".", nrow(Amat)), cex = 3)plot(Amat[,1],Amat[,2], pch = 19, col = gl(4,k=n,labels = c(1,2,3,4))); abline(h=mean(Amat[,1]),v=mean(Amat[,2]),lty = 3)plot(vegan::scores(m1), asp = 1, pch = 19, col = gl(4,k=n,labels = c(1,2,3,4))); abline(h=0,v=0,lty = 3)
# library(ggvegan)# autoplot(vegan::rda(Amat))
x = rnorm(100)# x = rt(100,1000)library(fitdistrplus)descdist(x, discrete = FALSE)
# summary statistics# ------# min: -2.307474 max: 2.735209 # median: -0.06346588 # mean: -0.03805934 # estimated sd: 0.9945627 # estimated skewness: 0.2436371 # estimated kurtosis: 2.934408fit.pois <- fitdist(x, "norm")plot(fit.pois)
Y = \beta_{0} + \beta_{1} x_{1} + \cdots + \beta_{p} x_{p} + \epsilon
Reference for the section
predict(lm.out, interval = "confidence")
; at mean of x and y, the uncertainty comes only from the intercept) to the line andpredict(lm.out, interval="prediction")
)predict(lm.out, interval="prediction")
)predict(lm.out, interval = "confidence")
; at mean of x and y, the uncertainty comes only from the intercept) to the line andType | Equation |
---|---|
Linear | Y = \beta_{0} + \beta_{1} x_{1} + \epsilon or \mu = \beta_{0} + \beta_{1} x_{1} for Y \sim Normal(\mu, \sigma) |
Poisson | Y \sim Poisson(\mu) with \text {ln} \mu=\beta_0+\beta_1x or \mu=e^{\beta_0+\beta_1x} (no separate error term as \lambda determines both the mean and variance) |
Logistic | Y \sim Binomial(p) with \text{log} \Bigl(\frac{p}{1-p}\Bigr) = \beta_0 +\beta_1x or p=\frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}} where \text{log} \Bigl(\frac{p}{1-p}\Bigr) is the log odds or log likelihood. The Y values are determined by a Bernoulli distribution (binomial of size = 1) |
set.seed(1234); n = 1000y1 = rnorm(n, mean = 15, sd = 1)y2 = rnorm(n, mean = 15.5, sd = 1)sim.aov1 = data.frame(y = y1, gr = "A")sim.aov2 = data.frame(y = y2, gr = "B")df.aov = rbind(sim.aov1, sim.aov2)df.aov$gr = factor(df.aov$gr)# t.test(y~gr, data = df.aov) oraov.out = aov(y~gr, data = df.aov)#summary(aov.out)tk.test = TukeyHSD(aov.out)round(tk.test$gr,2)# diff lwr upr p adj# B-A 0.54 0.45 0.63 0
plot(y~gr, data = df.aov)
Assumptions? see Wikipedia
set.seed(12345678)n = 100; beta0 = 2.5; beta1 = 0.8x.lm = runif(n = n) # or it could be something else! rnorm(n = n, mean = 10, sd = 1)err = rnorm(n = n, mean = 0, sd = 1)# Linear combination mu = beta0 + beta1*x.lm y.lm = mu + err # same as y.lm = rnorm(n = n, mu, sd = 1)# Make a dataframe of the data df.lm = data.frame(x = x.lm, y = y.lm)
round(coefficients(summary(lm.out)), 4)# Estimate Std. Error t value Pr(>|t|)# (Intercept) 2.4726 0.1635 15.1185 0.0000# x 0.8175 0.2779 2.9413 0.0041# Adjusted R^2summary(lm.out)$adj.r.squared# [1] 0.0717415# summary(lm.out)$fstatistic# anova(lm.out)
simulate
the response (fitted line), from a distribution used in a fitted model. sim.lm = simulate(lm.out, nsim = 2000, seed = 12) # simulate based on predicted value and sigma of modelrx = range(c(rowMeans(sim.lm), fitted(lm.out)))hist(rowMeans(sim.lm),xlim=rx,main="Hist simulation",xlab="");title(xlab="rowMeans(sim.lm)",line=2.2,cex.lab=1.2)hist(fitted(lm.out), xlim = rx,main="Hist fitted",xlab="");title(xlab="fitted(lm.out)",line=2.2,cex.lab=1.2)
c(mean(rowMeans(sim.lm)), mean(y.lm), mean(fitted(lm.out))) # compare# [1] 2.892684 2.891622 2.891622rbind(head(rowMeans(sim.lm)), head(fitted(lm.out))) # Average of all simulations (in row) gives the fitted values# 1 2 3 4 5 6# [1,] 3.182545 2.677072 3.248524 3.210164 2.470363 2.895410# [2,] 3.187451 2.708652 3.246897 3.195649 2.479667 2.899208
simulate()
allows to "Simulate one or more responses (Y) from the distribution corresponding to a fitted model object." stats:::simulate.lm
nsim=2head(simulate(lm.out, nsim=nsim, seed = 1))set.seed(1)for(i in 1:nsim) { tmp <- fitted(lm.out) + rnorm(length(predict(lm.out)), 0, summary(lm.out)$sigma) res <- if (i==1) tmp else cbind(res, tmp)}head(res)
replicate
the simulation a certain number of timesThis is where simulations shine. They really help us understand how we can interpret some results or get a better feeling of the confidence (or uncertainty) we have from our data. Or at least, this is one use of this.
I'd suggest opening RStudio to show the code
# Coefficients table:# Estimate Std. Error t value Pr(>|t|)# (Intercept) 5.4483 0.9409 5.7905 0# x1 0.5419 0.0922 5.8803 0# x2 0.4796 0.0157 30.5133 0# R Squared:# [1] 0.9083729
n = 100x1 = rnorm(n = n, mean = 10, sd = 1)x2 = rnorm(n = n, mean = 15, sd = 6)beta0 = 2.5beta1 = .8beta2 = .5err = rnorm(n = n, mean = 0, sd = 1)
set.seed(12345678)n = 100; meanx = 10x.lm = rnorm(n = n, mean = meanx, sd = 1)cat.lm = c("Control", "Treatment")gr = gl(n = length(cat.lm), k = n/2, labels = cat.lm) # makes factors (default)# Make a dummy matrixdummy.mat = matrix(data = 0, nrow = n, ncol = length(cat.lm))colnames(dummy.mat) <- paste0(c(cat.lm))dummy.mat[,1] <- 1 # Interceptdummy.mat[,2] <- ifelse(gr=="Treatment",1,0)beta1 = 0.8; betaControl = .2; betaTreament = .7# Combine the data to get the linear prediction lin.pred.x = dummy.mat %*% c(betaControl,betaTreament) + beta1 * x.lmy.lm = lin.pred.x + rnorm(n = n, mean = 0, sd = 1)df.lm = data.frame(x = x.lm, y = y.lm, gr = gr)b.5 = scales::alpha("black",alpha = .5); r.5 = scales::alpha("red",alpha = .5)lm.out = lm(y~x+gr, data = df.lm); lmcoef <- lm.out$coefficients
# Here you can see the beta1 (x), betaControl (Intercept) and betaTreament (grTreatment)summary(lm(y~x+gr, data = df.lm))# # Call:# lm(formula = y ~ x + gr, data = df.lm)# # Residuals:# Min 1Q Median 3Q Max # -2.34958 -0.54678 0.08423 0.50000 2.32288 # # Coefficients:# Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.11840 1.00510 -0.118 0.906470 # x 0.84485 0.09956 8.486 2.47e-13 ***# grTreatment 0.73082 0.18612 3.927 0.000161 ***# ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # Residual standard error: 0.9264 on 97 degrees of freedom# Multiple R-squared: 0.4577, Adjusted R-squared: 0.4465 # F-statistic: 40.93 on 2 and 97 DF, p-value: 1.291e-13# Get JUST the DIFFERENCE with the treatment (Intercept is the Control)# gr.means = lm(y~gr, data = df.lm) # coefficients(gr.means)
plot(y~x, data = df.lm, pch = 19, col = ifelse(gr=="Treatment",r.5,b.5))abline(a= lmcoef["(Intercept)"], b= lmcoef["x"], col = "black") # controlabline(a= lmcoef["(Intercept)"] + lmcoef["grTreatment"], b=lmcoef["x"], col="red")boxplot(y~gr, data = df.lm, pch = 19, col = ifelse(gr=="Treatment",r.5,b.5))
round(coefficients(summary(lm.out)), 4) # betaControl is (Intercept)# Estimate Std. Error t value Pr(>|t|)# (Intercept) -0.1184 1.0051 -0.1178 0.9065# x 0.8449 0.0996 8.4855 0.0000# grTreatment 0.7308 0.1861 3.9267 0.0002summary(lm.out)$adj.r.squared # betaTreament is grTreatment. # [1] 0.446509
# Expected Observed Notes# x 10 9.920725 Mean of X# y 8.2, 8.9 8.337635, 8.919466 Mean of each category in Y# m.beta1 8 8.381541 Global intercept (mean)# m.beta1.bC 8.2 8.26314 Global + control# m.beta1.bC.bT 8.9 8.993961 Global + control + treatment# beta1 0.8 0.8448517 # betaC 0.2 -0.1184008 # betaT 0.7 0.7308215
# Estimate Std. Error t value Pr(>|t|)# (Intercept) 2.4321 0.2085 11.6661 0# x1 1.5561 0.3094 5.0289 0# x2 2.1088 0.1590 13.2626 0# x1:x2 -3.0863 0.2276 -13.5584 0# [1] 0.7190843
n = 100x.lm1 = rbinom(n = n, size = 1, prob = 0.5)x.lm2 = rnorm(n = n, mean = 0, sd = 1) x.lm2 = scale(x.lm2)beta0 = 2.5beta1 = 1.5beta2 = 2beta3 = -3err.lm = rnorm(n = n, mean = 0, sd = 1)
plot(x = df.lm[df.lm$x1 == 0, ]$x2, y = df.lm[df.lm$x1 == 0, ]$y, # Add x1 = 0 col = rgb(red = 0, green = 0, blue = 1, alpha = 0.25), pch = 19, xlab = "x2", ylab = "y", ylim = range(df.lm$y)); abline(h=0, v=0, lt =3)abline(a = coef(lm.out)[1], b = coef(lm.out)[3], col = "blue", pch = 19, lwd =2)points(x = df.lm[df.lm$x1 == 1, ]$x2, y = df.lm[df.lm$x1 == 1, ]$y, # Add x1 = 1 col = rgb(red = 1, green = 0, blue = 0, alpha = 0.25), pch = 19)abline(a = coef(lm.out)[1] + coef(lm.out)[2], b = coef(lm.out)[3] + coef(lm.out)[4], col = "red", lwd = 2)
See Tutorial 11: Interaction terms, by Johannes Karreth
# If x1 = 1# $\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2$# $\hat{y}=\beta_0+\beta_1 (1)+\beta_2x_2+\beta_3(1)x_2$# $\hat{y}=(\beta_0+\beta_1)+x_2(\beta_2+\beta_3)$
hist(df.lm$x1);hist(df.lm$x2);hist(df.lm$y)
lm.out = lm(y~x1*x2, data = df.lm); lmcoef = coef(lm.out)# Make a new range of x2 values on which we will test the effect of x1 x2r = range(x.lm2); x2.sim = seq(x2r[1], x2r[2], by = .5); x2s.lg=length(x2.sim)# This is the effect of x1 at different values of x2 (moderates the effect of x1)eff.x1 <- lmcoef["x1"] + lmcoef["x1:x2"] * x2.sim # gets slopes eff.x1.int <- lmcoef["(Intercept)"] + lmcoef["x2"] * x2.sim # gets int. eff.dat <- data.frame(x2.sim, eff.x1, eff.x1.int) # Put that in dataframe virPal <- viridis::viridis(length(x2.sim), alpha = .8) # Get colours eff.dat$col <- virPal[as.numeric(cut(eff.dat$x2.sim, breaks = x2s.lg))]df.lm$col <- virPal[as.numeric(cut(df.lm$x2, breaks = x2s.lg))]df.lm$line <- c("black","red")[as.numeric(cut(df.lm$x2, breaks = x2s.lg)) %% 2+1]
# summary(lm.out)# coef(lm.out)# To show the plot of an interaction between 2 continuous variables, we need to use the marginal effects or the conditional effect (effect of x1 conditional on the values of x2)# The marginal effect of x_1 is $x_1 = \beta_1+\beta_3\times x_2$# The marginal effect of x_2 is $x_2 = \beta_2+\beta_3\times x_1$
par(mfrow=c(1,1), mar =c(4,4,1,1))plot(x = df.lm$x1, y = df.lm$y, bg = df.lm$col, pch = 21, xlab = "x1", ylab = "y")apply(eff.dat, 1, function(x) abline(a = x[3], b = x[2], col = x[4], lwd = 2))# NULLabline(h = 0, v = 0,lty = 3)legend("topleft", title = "x2",legend = round(eff.dat$x2.sim,1), lty = 1, lwd = 3, col = eff.dat$col, bg = scales::alpha("white",.5))
summary(lm.out)# # Call:# lm(formula = y ~ x1 * x2, data = df.lm)# # Residuals:# Min 1Q Median 3Q Max # -2.92554 -0.43139 0.00249 0.65651 2.60188 # # Coefficients:# Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.3583 1.4868 0.241 0.81 # x1 1.9458 0.2875 6.769 1.03e-09 ***# x2 3.1702 0.7657 4.140 7.46e-05 ***# x1:x2 2.7603 0.1485 18.591 < 2e-16 ***# ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # Residual standard error: 1.035 on 96 degrees of freedom# Multiple R-squared: 0.9967, Adjusted R-squared: 0.9966 # F-statistic: 9696 on 3 and 96 DF, p-value: < 2.2e-16
See the idea from ?lm Annette Dobson (1990) "An Introduction to Generalized Linear Models".Page 9: Plant Weight Data.
set.seed(1234)n = 10; sd.e = 0.3control = rnorm(n, mean = 5.03, sd = 0.28)treatmt = rnorm(n, mean = 4.66, sd = 0.49)gr <- gl(n = 2, k = n, length = n*2, labels = c("ctl","trt"))weight = c(control,treatmt) + rnorm(2*n, mean = 0, sd = sd.e)plant.weight = data.frame(weight=weight, gr = gr)lm.out.int <- lm(weight ~ gr, data = plant.weight)lm.out.noi <- lm(weight ~ gr-1, data = plant.weight) # Comparing the data to 0 (int. = 0, which becomes the reference)anova(lm.out.int) # this time, it was significant # Analysis of Variance Table# # Response: weight# Df Sum Sq Mean Sq F value Pr(>F) # gr 1 0.9422 0.94219 4.8569 0.04079 *# Residuals 18 3.4918 0.19399 # ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# summary(lm.out.noi) # Just to see the estimate (NOT THE P-VALUES)s.lm.out = summary(lm.out.int)# You can get the p-value with this function pt(q = abs(s.lm.out$coefficients["grtrt","t value"]),df = s.lm.out$df[2], lower.tail = F) *2# [1] 0.04079444
plot(weight~gr, data=plant.weight, las=1)
plot(lm.out.int, las = 1)
# Make the function for the simulationexp.plant <- function(n = 100, # number of seeds in EACH group (2*n = total) sd.e = 0.3, plot=F, ret = T) { control = rnorm(n, mean = 5.03, sd = 0.28) treatmt = rnorm(n, mean = 4.66, sd = 0.49) gr <- gl(n = 2, k = n, length = 2*n, labels = c("ctl","trt")) weight = c(control,treatmt) + rnorm(2*n, mean = 0, sd = sd.e) plant.weight = data.frame(weight=weight, gr = gr) if (plot) { plot(weight ~ gr, data = plant.weight, las = 1) } lm.out.int <- lm(weight ~ gr, data = plant.weight) s.lm.out = summary(lm.out.int) if (ret) { return(s.lm.out) }}
set.seed(2345); nb.rep = 2000 # number of replications in the simulation l.rp = replicate(n = nb.rep, simplify = FALSE, expr = exp.plant( n = 10)$coefficients["grtrt","t value"])p.val.lm = pt(q = abs(unlist(l.rp)),df = s.lm.out$df[2], lower.tail = F)*2 # Get p-values exp.plant(n = 10, plot = T, ret = F) # plot a simulation example hist(unlist(l.rp), main = "t-values", probability = T, xlab ="")lines(density(unlist(l.rp)), col="blue", lwd=2); abline(v = qt(0.025, df = s.lm.out$df[2]))hist(p.val.lm, main = "p-values", probability = T, xlab ="", xlim = c(0,1))lines(density(p.val.lm), col="red", lwd=2); abline(v = 0.05)
qt(0.05, df = s.lm.out$df[2])# [1] -1.734064sum(unlist(l.rp)<qt(0.025, df = s.lm.out$df[2]))/length(unlist(l.rp))# [1] 0.3145sum(p.val.lm<=0.05)/length(p.val.lm)# [1] 0.3145
set.seed(987654)n = 100x1 = rnorm(n = n, mean = 6, sd = 1)x2 = rnorm(n = n, mean = 0, sd = 1)# Rescale the datax1z = scale(x1)x2z = scale(x2)z = 0 + 2*x1z + 3*x2 # This gives the LOG odds. Recall that $p = odds/(1+odds)$pr = 1/(1+exp(-z)) # Transform to get the LOG(odds) # inverse-logit function; y = rbinom(n = n, size = 1, prob = pr) # Bernoulli response variable # Combine the data in a dataframe df = data.frame(y = y, x1 = x1, x2 = x2)# Compute the modelglm.logist = glm( y~x1+x2, data=df, family="binomial")
pr2 = boot::inv.logit(z)
# source : https://sebastiansauer.github.io/convert_logit2prob/logit2prob <- function(logit){ odds <- exp(logit) prob <- odds / (1 + odds) return(prob)}coef(glm.logist)# (Intercept) x1 x2 # -11.503498 1.987659 3.061898round(logit2prob(coef(glm.logist)),5)# (Intercept) x1 x2 # 0.00001 0.87950 0.95529
plot(y~x1, data = df, col = scales::alpha("black",.5), pch = 19)newdata <- data.frame(x1=seq(min(x1), max(x1),len=n), x2 = seq(min(x2), max(x2),len=n))newdata$y = predict(object = glm.logist, newdata = newdata, type = "response") lines(x = newdata$x1, y = newdata$y, col = "red",lwd = 2)
# (Intercept) x1 x2 # -11.503498 1.987659 3.061898
sd
the estimate is from a z-distribution. # Estimate Std. Error z value Pr(>|z|)# (Intercept) -4.752291 0.7031429 -6.758641 1.392917e-11# x1 0.835486 0.1179451 7.083686 1.403694e-12
# Probability at intercept is 0.008558029# Probability for an increase in 1 sd of X is 0.6975137
x1 = rnorm(n = n, mean = 6, sd = 1)x1z = scale(x1)z = 0 + 2*x1z pr = 1/(1+exp(-z))y = rbinom(n = n, size = 1, prob = pr)
# Estimate Std. Error z value Pr(>|z|)# (Intercept) 1.1567584 0.2365723 4.889662 1.010094e-06# x1 0.2510816 0.1780770 1.409961 1.585512e-01# I(x1^2) -0.3458782 0.1378428 -2.509222 1.209974e-02# [1] 1# Probability at intercept is 0.3339322
mean(df$x)+0*sd(df$x)# [1] 5.8655161-1/(1+exp(mod.test(b0,b1,b2,mean(df$x)+0*sd(df$x))))# [1] 0.7607432mean(df$x)+1*sd(df$x)# [1] 8.8333691-1/(1+exp(mod.test(b0,b1,b2,mean(df$x)+1*sd(df$x))))# [1] 0.7430653mean(df$x)+2*sd(df$x)# [1] 11.801221-1/(1+exp(mod.test(b0,b1,b2,mean(df$x)+2*sd(df$x))))# [1] 0.5684202mean(df$x)+5*sd(df$x)# [1] 20.704781-1/(1+exp(mod.test(b0,b1,b2,mean(df$x)+5*sd(df$x))))# [1] 0.001956208
set.seed(6)n = 150x1 = rnorm(n = n, mean = 6, sd = 3)x1z = scale(x1)B0 = 1B1 = .4B2 = -.5z = B0 + B1*x1z + B2*x1z^2 pr = 1/(1+exp(-z)) y = rbinom(n = n, size = 1, prob = pr) df = data.frame(y = y, x1 = x1z, x = x1)glm.logist = glm( y~x1+I(x1^2), data=df, family="binomial")
n <- 10000categories = c("A","B", "C")beta0 <- 2betaB <- 3.1betaC <- -4.35beta1 <- 2xxx = rnorm(n = n, mean = 0, sd = 1)linpred <- dummy.mat[,'xA']*beta0 + dummy.mat[,"xB"] * betaB + dummy.mat[,"xC"] * betaC + beta1 * xxxprob_i <- exp(linpred) / (1 + exp(linpred))y <- rbinom(n = n, size = 1, prob = prob_i)
set.seed(42)n = 500x = rnorm(n = n, mean = 0, sd = 1)# Rescale the dataxz = scale(x)log.mu = 1 + .8*xzy = rpois(n = n, lambda = exp(log.mu)) # Combine the data in a dataframe df = data.frame(y = y, x = x)
#now feed it to glm:glm.poisson = glm( y~x, data=df, family="poisson")plot(y~x, data = df, col = scales::alpha("black", 0.5), pch = 19, cex = 0.5, main = "Poisson regression")newdata <- data.frame(x = seq(min(x), max(x), len = n))newdata$y = predict(object = glm.poisson, newdata = newdata, type = "response") lines(x = newdata$x, y = newdata$y, col = "red",lwd = 2)glm.gau = glm( y~x, data=df, family="gaussian")hist(residuals(glm.gau), main = "Residuals Gaussian")hist(residuals(glm.poisson), main = "Residuals Poisson")
Y = \beta_{0} + \beta_{1} x_{1} + \cdots + \beta_{p} x_{p} + \epsilon
Y = \beta_{0} + \beta_{1} x_{1} + \cdots + \beta_{p} x_{p} + \epsilon
This only means that \mathbf{Y} is a vector (or matrix) made up of the linear combination of fixed effects \mathbf{X}\beta and a random part \epsilon.
In this type of notation, LMMs is (random intercept): \mathbf{Y} = \mathbf{X}\beta + \mathbf{Z\upsilon} + \epsilon
With all the previous elements being the same but the random part which is composed of \mathbf{Z \upsilon} + \epsilon, for all categories \mathbf{\upsilon}. \mathbf{X} is the design matrix and \mathbf{Z} is the block matrix.
y_{ij} = \beta_{0} + \beta_{1} x_{1ij} + \upsilon_{1j} x_{1ij} + \upsilon_{0ij} + \epsilon_{0ij}
One implementation of that could be ( \beta_1 and \beta_2 would represent different treatments, v_1 and v_2 are different categories):
\left[\begin{array}{l} y_{11} \\ y_{21} \\ y_{12} \\ y_{22} \end{array} \right] = \left[ \begin{array}{ll} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{array}\right]\left[\begin{array}{l} \beta_{1} \\ \beta_{2} \end{array}\right]+\left[\begin{array}{ll} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{array}\right]\left[\begin{array}{l} v_{1} \\ v_{2} \end{array}\right]+\left[\begin{array}{l} \epsilon_{11} \\ \epsilon_{21} \\ \epsilon_{12} \\ \epsilon_{22} \end{array}\right]
set.seed(16)# Experimental design lakes = 6f.sp = 3n.id = 10 # Setting parameterssd = 0.3# Individual measure sd.lake = .05sd.fish = .02beta0 = 1beta1 = 0.003total.n = lakes*f.sp*n.id# Getting the age status of a fish age = rep(c(rep("Juv",2),rep("Ado",6),rep("Adt",2)), lakes)n.juv = length(which(age =="Juv"))n.ado = length(which(age =="Ado"))n.adt = length(which(age =="Adt"))age.df = data.frame(age,length = NA)
# Generating the length of the fish depending on the age it has age.df[age.df$age =="Juv","length"] <- rnorm(n.juv,mean = 100,sd = 10)age.df[age.df$age =="Ado","length"] <- rnorm(n.ado,mean = 250,sd = 50)age.df[age.df$age =="Adt","length"] <- rnorm(n.adt,mean = 400,sd = 10)# trophic position is the response variable # Fish length is the phenotype that is measured lake.rep = gl(n = lakes,k = f.sp*n.id,labels = paste0("L",1:lakes))# length(lake.rep)sp.rep = rep( x = gl(n = f.sp,k = n.id, labels = paste0("s",1:f.sp)), lakes)# length(sp.rep)f.id = paste0("f.",1:(lakes*f.sp*n.id)) # Random effects # Setting it up with rnorm (the correct way)# lake.rdm.eff = rep( rnorm(lakes, 0, sd.lake), each = f.sp*n.id)# setting it up manually to see what happens when you modify the valuemy.rdm.lake = c(1,2, 1.1,0.8,1.5,1.8)lake.rdm.eff = rep( my.rdm.lake, each = f.sp*n.id)# Setting it up with rnorm (the correct way)# fish.rdm.eff = rep( rnorm(f.sp, 0, sd.fish), each = n.id)# setting it up manually to see what happens when you modify the valuemy.rdm.fish = c(-0.5,.4, -0.2)fish.rdm.eff = rep( my.rdm.fish, each = n.id)# Individual error id.err = rnorm(lakes*f.sp*n.id, 0, sd)f.dat = data.frame(lake = lake.rep, lake.rdm.eff, species = sp.rep, lake.rdm.eff, fish.rdm.eff, id = f.id, age.df)f.dat$t.lvl = with(f.dat, beta0 + beta1*length +lake.rdm.eff+fish.rdm.eff+ id.err )f.dat$z_lgt = scale(f.dat$length) #(f.dat$Lake - mean(f.dat$length)) /sd(f.dat$length)f.dat$z.t.lvl = scale(f.dat$t.lvl) #(f.dat$Lake - mean(f.dat$length)) /sd(f.dat$length)
# Individual error id.err = rnorm(lakes*f.sp*n.id, 0, sd)f.dat = data.frame(lake = lake.rep, lake.rdm.eff, species = sp.rep, lake.rdm.eff, fish.rdm.eff, id = f.id, age.df)f.dat$t.lvl = with(f.dat, beta0 + beta1*length +lake.rdm.eff+fish.rdm.eff+ id.err )f.dat$z_lgt = scale(f.dat$length) #(f.dat$Lake - mean(f.dat$length)) /sd(f.dat$length)f.dat$z.t.lvl = scale(f.dat$t.lvl) #(f.dat$Lake - mean(f.dat$length)) /sd(f.dat$length)
set.seed(16)nstand = 5nplot = 4mu = 10sds = 2sd = 1stand = rep(LETTERS[1:nstand], each = nplot) plot = letters[1:(nstand*nplot)] standeff = rnorm(nstand, 0, sds) standeff = rep(standeff, each = nplot) ploteff = rnorm(nstand*nplot, 0, sd) dat = data.frame(stand, standeff, plot, ploteff) dat$resp = with(dat, mu + standeff + ploteff )
library(lme4,warn.conflicts = FALSE)fit1 = lmer(resp ~ 1 + (1|stand), data = dat)fit1# Linear mixed model fit by REML ['lmerModLmerTest']# Formula: resp ~ 1 + (1 | stand)# Data: dat# REML criterion at convergence: 72.5943# Random effects:# Groups Name Std.Dev.# stand (Intercept) 2.168 # Residual 1.130 # Number of obs: 20, groups: stand, 5# Fixed Effects:# (Intercept) # 10.57twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 2, sigma = 1) { standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot) stand = rep(LETTERS[1:nstand], each = nplot) ploteff = rnorm(nstand*nplot, 0, sigma) resp = mu + standeff + ploteff dat = data.frame(stand, resp) lmer(resp ~ 1 + (1|stand), data = dat)}set.seed(16)twolevel_fun()# Linear mixed model fit by REML ['lmerModLmerTest']# Formula: resp ~ 1 + (1 | stand)# Data: dat# REML criterion at convergence: 72.5943# Random effects:# Groups Name Std.Dev.# stand (Intercept) 2.168 # Residual 1.130 # Number of obs: 20, groups: stand, 5# Fixed Effects:# (Intercept) # 10.57sims = replicate(100, twolevel_fun(), simplify = FALSE )sims[[100]]# Linear mixed model fit by REML ['lmerModLmerTest']# Formula: resp ~ 1 + (1 | stand)# Data: dat# REML criterion at convergence: 58.0201# Random effects:# Groups Name Std.Dev.# stand (Intercept) 1.7197 # Residual 0.7418 # Number of obs: 20, groups: stand, 5# Fixed Effects:# (Intercept) # 7.711library(broom.mixed)tidy(fit1)# # A tibble: 3 × 8# effect group term estimate std.error statistic df p.value# <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl># 1 fixed <NA> (Intercept) 10.6 1.00 10.5 4.00 0.000457# 2 ran_pars stand sd__(Intercept) 2.17 NA NA NA NA # 3 ran_pars Residual sd__Observation 1.13 NA NA NA NAtidy(fit1, effects = "fixed")# # A tibble: 1 × 7# effect term estimate std.error statistic df p.value# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl># 1 fixed (Intercept) 10.6 1.00 10.5 4.00 0.000457tidy(fit1, effects = "ran_pars", scales = "vcov")# # A tibble: 2 × 4# effect group term estimate# <chr> <chr> <chr> <dbl># 1 ran_pars stand var__(Intercept) 4.70# 2 ran_pars Residual var__Observation 1.28library(purrr) # v. 0.3.4suppressPackageStartupMessages( library(dplyr) ) # v. 1.0.7library(ggplot2) # v. 3.3.5stand_sims = c(5, 20, 100) %>% set_names() %>% map(~replicate(100, twolevel_fun(nstand = .x) ) )stand_vars = stand_sims %>% modify_depth(2, ~tidy(.x, effects = "ran_pars", scales = "vcov") ) %>% map_dfr(bind_rows, .id = "stand_num") %>% filter(group == "stand")head(stand_vars)# # A tibble: 6 × 5# stand_num effect group term estimate# <chr> <chr> <chr> <chr> <dbl># 1 5 ran_pars stand var__(Intercept) 2.53# 2 5 ran_pars stand var__(Intercept) 7.70# 3 5 ran_pars stand var__(Intercept) 4.19# 4 5 ran_pars stand var__(Intercept) 3.28# 5 5 ran_pars stand var__(Intercept) 12.9 # 6 5 ran_pars stand var__(Intercept) 4.62
ggplot(stand_vars, aes(x = estimate) ) + geom_density(fill = "blue", alpha = .25) + facet_wrap(~stand_num) + geom_vline(xintercept = 4)
stand_vars = mutate(stand_vars, stand_num = forcats::fct_inorder(stand_num) )add_prefix = function(string) { paste("Number stands:", string, sep = " ")}groupmed = stand_vars %>% group_by(stand_num) %>% summarise(mvar = median(estimate) )ggplot(stand_vars, aes(x = estimate) ) + geom_density(fill = "blue", alpha = .25) + facet_wrap(~stand_num, labeller = as_labeller(add_prefix) ) + geom_vline(aes(xintercept = 4, linetype = "True variance"), size = .5 ) + geom_vline(data = groupmed, aes(xintercept = mvar, linetype = "Median variance"), size = .5) + theme_bw(base_size = 14) + scale_linetype_manual(name = "", values = c(2, 1) ) + theme(legend.position = "bottom", legend.key.width = unit(.1, "cm") ) + labs(x = "Estimated Variance", y = NULL)
stand_vars %>% group_by(stand_num) %>% summarise_at("estimate", list(min = min, mean = mean, med = median, max = max) )# # A tibble: 3 × 5# stand_num min mean med max# <fct> <dbl> <dbl> <dbl> <dbl># 1 5 0 3.57 2.95 12.9 # 2 20 2.05 4.18 4.05 9.37# 3 100 1.26 4.08 4.09 7.91stand_vars %>% group_by(stand_num) %>% summarise(mean(estimate < 4) )# # A tibble: 3 × 2# stand_num `mean(estimate < 4)`# <fct> <dbl># 1 5 0.64# 2 20 0.49# 3 100 0.47
For now, please see this link: Chapter 4 Simulating Mixed Effects by Lisa DeBruine.
library(lmerTest, warn.conflicts = F)n = 20; sd.n = 2# Generate dataframe x = 1:nvalues = rnorm(n = n,mean = 0,sd = sd.n)gr = rep(c("short","tall"), each = n/2)sim.df = data.frame(x,values,gr)plot(density(sim.df[sim.df$gr%in%"short","values"]), col = "black", ylim = c(0,1), main = "Density")lines(density(sim.df[sim.df$gr%in%"tall","values"]), col = "red")legend("toprigh",legend = c("Short","Tall"),col = c("black","red"), lty = 1)
plot(0:100,0:100,type="n",xlab="",ylab="", main = "Random walk", asp = 1)x <- y <- 50points(50,50,pch=16,col="red",cex=1.5)set.seed(1)for (i in 1:1000) { xi <- sample(c(1,0,-1),1); yi <- sample(c(1,0,-1),1) lines(c(x,x+xi),c(y,y+yi),col=scales::alpha("blue", alpha = .7)) x <- x+xi; y <- y+yi if (x>100 | x<0 | y>100 | y<0) break }points(x,y,pch=16,col="green",cex=1.5)
st_sample()
. In the Examples section, there are some neat simulations of points in polygons. library(sf); library(ggplot2)polygon = list(matrix(c(2, 2, 3, 3, 2.5, 4, 3, 4, 2.5, 5, 1, 4, 0, 5, 1, 3, 2, 2), ncol=2, byrow=T)) polygon = sf::st_polygon(polygon) # Create an sf polygonpoints = sf::st_sample(polygon, size=50) # Sample 50 rdm pts in the polygon# Plot using the ggplot geom_sf function.pol.sf = ggplot() + geom_sf(aes(), data=polygon) + geom_sf(aes(), col = alpha("black",.4), data=points) + theme_classic()
# Get random pointsset.seed(456)rdm.n = 5rdm.pt = st_sample(x = only.park, size = rdm.n)map.rdm = mapView(only.park, col.regions = c("green")) + mapView(rdm.pt) # Random points# map.rdm## save the output # mapshot(map.rdm,file = "images/map_rdm.png", # url = "images/map_rdm.html")
# Add a buffer around the point we want to look at n = 10*2min.buff = units::set_units(x = 0, value = "m")max.buff = units::set_units(x = 100, value = "m")buffer = st_buffer(x = c(selected.point, selected.point.no.tree), dist = units::set_units(x = 100, value = "m"))
set.seed(6543)n = 10*2min.buff = units::set_units(x = 0, value = "m")max.buff = units::set_units(x = 100, value = "m")# Get random distance # rdm.disttmp = rexp(n = n, rate = 1)*20rdm.disttmp = runif(n = n, min = 0, max = max.buff)# get random anglerdm.angtmp = runif(n = n, min=0, max = 360)# Conversion between Radians and Degreesrad = rdm.angtmp * pi/180rdm.ppt_x = rdm.disttmp*cos(rad) + c(st_coordinates(selected.point)[1], st_coordinates(selected.point.no.tree)[1])rdm.ppt_y = rdm.disttmp*sin(rad) + c(st_coordinates(selected.point)[2], st_coordinates(selected.point.no.tree)[2])rmd.ptdf = data.frame(rdm.ppt_x, rdm.ppt_y, length(rdm.ppt_x))rmd.ptdf.sf = sf::st_as_sf(rmd.ptdf, coords = c("rdm.ppt_x","rdm.ppt_y"), crs = 32188)#4326)
set.seed(456)# Add random points that are occupying the space of the polygon (grid )rdm.pt = st_sample(x = only.park, size = 100, type ="hexagonal")map.grid = mapView(only.park, col.regions = c("green")) + mapView(rdm.pt)#; map.grid## save html to png# mapshot(map.grid,file = "images/map_grid.png", url = "images/map_grid.html")
Now that we have a better understanding of distributions and how to play with them, we are going to look into the subject of power analysis.
Null hypothesis is... | TRUE | FALSE |
---|---|---|
Rejected | Type I error (FP) \alpha | Correct decision (true positive, TP) 1-\beta |
Not rejected | Correct decision (true negative, TN) 1-\alpha | Type II error (FN) \beta |
# cod.id cod.11 cod.21 cod.51 cod.12 cod.22 cod.52# 1 0.08 1 2 4.93 0.58 1.33 3.14
sim.ci.t <- function(n = 30) { gr1 = rnorm(n = n, mean = 0, sd = 2) gr2 = rnorm(n = n, mean = 1, sd = 2) ttr = t.test(gr1, gr2, paired = FALSE, var.equal = TRUE, conf.level = 0.95) return(ttr$p.value)}set.seed(1)get.p.vals = replicate(n = 1000,expr = sim.ci.t(),simplify = T)alpha.level = 0.05mean(get.p.vals<alpha.level) # this is the power, proportion of p-values that we are smaller than the alpha level# [1] 0.471# table(get.p.vals<alpha.level)# prop.table(table(get.p.vals<alpha.level))# You can see that mean(get.p.vals<alpha.level) is much lower than the 95% power desired...
pwr
package explains how to perform power analysis. p1 = .75 # Proportion to test p2 = .50 # proportion of the null hypothesis alpha = 0.05 # Type 1 error pwr = 0.80 # power or 1-beta = power. Beta is the type II error coin.p.power = power.prop.test(p1 = p1, p2 = p2, sig.level = alpha, power = pwr)n = ceiling(coin.p.power$n) # get the sample size from the power analysiscoin.pip = rbinom(n, size = 1, prob = p1) # Generate coin tosses p.table = table(coin.pip)[c(2,1)] # Get the number of 1 and 0s (ptest = prop.test(p.table, alternative = "greater")) # Do the test # # 1-sample proportions test with continuity correction# # data: p.table, null probability 0.5# X-squared = 7.6034, df = 1, p-value = 0.002913# alternative hypothesis: true p is greater than 0.5# 95 percent confidence interval:# 0.5742416 1.0000000# sample estimates:# p # 0.6896552
# coin.pip = rbinom(1,size = n,prob = p1) # similar to above # prop.test(x = coin.pip, n = n,alternative = "greater")
curve(dchisq(x, df = ptest$parameter), xlim = c(0, ceiling(max(ptest$statistic))))abline(v = ptest$statistic, lty = 3)
curve(dchisq(x, df = ptest$parameter), xlim = c(0, ceiling(max(ptest$statistic))))abline(v = ptest$statistic, lty = 3)
library(pwr)r2 = seq(0,0.9,by =.1)f2 <- function(r2) {r2/(1-r2)}
plot(f2(r2)~r2)curve(expr = (x/(1-x)),add = T)
nb.coef.in.model = 2pwr.lm = pwr.f2.test(u = nb.coef.in.model, f2 = f2(.3), sig.level = 0.001, power = 0.8)# sample size (n = v + u + 1)n = ceiling(pwr.lm$v + nb.coef.in.model + 1)n# [1] 53
pwr
packagepwr2ppl
simr
package uses simulations to perform power analysis for Generalised Linear Mixed Models (GLMMs) with the lme4
packgefor(){}
loops, we are going to simulate data, get a desired result and store the result for future analysis ATTENTION. This is an iterative process. Usually, we make small increments writing up the code and then put it all together in a cohesive function that will make exactly what we are looking for.
# Defining the population n = 600 # Number of elements to be generated set.seed(13) # Set RNG x = rnorm(n) # Generate numbers from normal distribution reedddd = scales::alpha("blue",.4) # Make colour transparent # Definte the function sample.mean.pop.est <- function(x,n, sample.size, ylim = NULL) { # x: is the actual values of the trait measured # n: size of the population (number of individuals or items) # sample.size: how big is the sample size from which the MEAN will be calculated from # ylim: add a maximum if needed # histogram of the population # Just get the stats from the histogram pop.hist = hist(x, plot = F) # Make base histogram # Make empty vector tmp.v = c(NA) # For loop to calculate the mean based on a sample from the population for (i in 1:n) { tmp = sample(x = x, size = sample.size, replace = FALSE) # Record that information (mean of the sample) tmp.v[i] = mean(tmp) } # End i # Sample histogram sample.hist = hist(tmp.v, plot = F) # Population histogram hist(x, ylim = range(c(0,c(sample.hist$counts, pop.hist$counts), ylim)), main = paste("Sample n =", sample.size)) # Add the sample estimate sample.hist = hist(tmp.v, col = reedddd, add=T)} # End sample.mean.pop.est
par(mfrow=c(2,2), lwd = .3)sample.mean.pop.est(x = x, n = n, sample.size = 1, ylim = 300)sample.mean.pop.est(x = x, n = n, sample.size = 10, ylim = 300)sample.mean.pop.est(x = x, n = n, sample.size = 50, ylim = 300)sample.mean.pop.est(x = x, n = n, sample.size = 500, ylim = 300)
# 0 0.00 TJWPPBT IYDHKFIEOSQQEZJCSMCQ# 20 0.43 MJTPVNT IID KS EOSE E WESMCQ # 40 0.64 MJTPVNQ IT IS LOLE A WESSCQ # 60 0.79 MJTPVNKS IT IS LIKE A WESSCQ # 80 0.93 MJTPINKS IT IS LIKE A WEASEL # 100 0.93 MJTPINKS IT IS LIKE A WEASEL# 119 1.00 METHINKS IT IS LIKE A WEASEL
# 119 0.04 NSPZXZKXBVKFGJQHGFALTHZLCINM
simecol
(simulation of ecological systems), cds
with function simpca
to simulate PCA.mobsim
measurement of biodiversity across space (archived)rspecies
with function rspecies
(archived)library(cds)simpca(nr.indv = rep(200, 5), m = 10, q = 7, R = rcormat(m = 10), err.coeff = 0.1, alphamat = rbind(c(0.5, 2, 4), c(10, 2, 10), c(1, 2, 1), c(4, 2, 0.5), c(0.1, 2, 0.1))[1:length(rep(200, 5)), ], randomize = FALSE)
Thanks to all the people that contributed to this presentation. All of the inspiration for this workshop is referenced in the markdown or linked in the notes.
In particular, thank to the Barrett lab graduate students (McGill) for their valuable insights and support, Winer Daniel Reyes Corral (McGill), Pierre-Olivier Montiglio (UQAM), etc.
Thank you for your participation and helping me improve the workshop.
wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/beausoleilmo/Rsimulation/main/workshopXX-En/scripts/workshopXX-en.r
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |