A course with R, Stan, and brms
This course is largely inspired from the following books:
McElreath, R. (2016, 2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Kurz, S. (2019). Statistical Rethinking with brms, ggplot2, and the tidyverse. Available online.
Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, third edition. London: CRC Press.
Lambert, B. (2018). A Student’s Guide to Bayesian Statistics. SAGE Publications Ltd.
Noël, Y. (2015). Psychologie Statistique. EDP Sciences.
Nicenboim, B., Schad, D., & Vasishth, S. (2021). An Introduction to Bayesian Data Analysis for Cognitive Science. Available online.
Code and slides are available at the course’s website: https://lnalborczyk.github.io/IBSM2023/.
General objectives:
Practical objectives:
R
, interpreting and reporting the results) of a simple dataset.Course n°01: Introduction to Bayesian inference, Beta-Binomial model
Course n°02: Introduction to brms, linear regression
Course n°03: Markov Chain Monte Carlo, generalised linear model
Course n°04: Multilevel models, cognitive models
\[\newcommand\given[1][]{\:#1\vert\:}\]
What is the probability…
Of obtaining an odd number with a fair die?
Of you learning something new during this course?
Do these two questions refer to the same “sort” of probability?
\[ \Pr(\text{odd}) = \frac{\text{number of favorable issues}}{\text{total number of possible issues}} = \frac{3}{6} = \frac{1}{2} \]
Problem: this definition only applies to situations in which there is a finite number of equiprobable potential outcomes…
Limitation: what is the probability of raining tomorrow?
\[\Pr(\text{rain}) = \frac{\text{rain}}{ \{ \text{rain, not-rain} \} } = \frac{1}{2}\]
\[\Pr(x) = \lim_{n_{t} \to \infty}\frac{n_{x}}{n_{t}}\]
Where \(n_{x}\) is the number of occurrences of the event \(x\) and \(n_{t}\) the total number of observations. The frequentist interpretation postulates that, in the long-run (i.e., when the number of observations approaches infinity), the relative frequency of an event will converge exactly with what we call “probability”.
Consequence: the concept of probability only applies to collectives, not to single events.
library(tidyverse)
sample(x = c(0, 1), size = 500, prob = c(0.5, 0.5), replace = TRUE) %>%
data.frame() %>%
mutate(x = seq_along(.), y = cummean(.) ) %>%
ggplot(aes(x = x, y = y) ) +
geom_line(lwd = 1) +
geom_hline(yintercept = 0.5, lty = 3) +
labs(x = "Trial number", y = "Proportion of heads") +
ylim(0, 1)
Which reference class? What is the probability that I will live until 80 years old? As a man? As a French person?
What about non-repeatable events? What is the probability of you learning something new during this course?
The resolution issue: How many observations do we need to get a good approximation of the underlying probability? A finite class of events of size \(n\) can only produce relative frequencies with precision \(1/n\)…
The frequentist (i.e., long-term) properties of objects (e.g., a coin) are caused by the intrinsic physical properties of the objects. For example, a biased coin will generate a biased relative frequency (and therefore probability) because of its physical properties. For propensionists, probabilities represent these intrinsic characteristics, these propensities to generate certain relative frequencies, and not the relative frequencies themselves.
Consequence: these properties are the properties of individual events… and not of sequences! The propensionist interpretation therefore allows us to talk about the probability of single events.
The logical interpretation of the concept of probability attempts to generalise logic (true/false) to the probabilistic world. Probability therefore represents the degree of logical support that a conclusion has, relative to a set of premises (Carnap, 1971; Keynes, 1921).
Consequence: all probability is conditional.
Probability is a measure of the degree of uncertainty. An event that is certain will therefore have a probability of 1 and an event that is impossible will have a probability of 0.
So to assign equal probabilities to two events is not in any way an assertion that they must occur equally often in any random experiment […], it is only a formal way of saying I don’t know (Jaynes, 1986).
To talk about probabilities in this context, we no longer need to refer to the limit of occurrence of an event (frequency).
For more details, see this article from the Stanford Encyclopedia of Philosophy.
Epistemic probability
All probabilities are conditional on available information (e.g., premises or data). Probability is used as a means of quantifying uncertainty.
Logical interpretation, Bayesian interpretation.
Physical probability
Probabilities depend on a state of the world, on physical characteristics, and are independent of available information (or uncertainty).
Classical interpretation, frequentist interpretation.
A probability is a numerical value assigned to an event \(A\), understood as a possibility belonging to the set of all possible outcomes \(\Omega\).
Probabilities have to conform to the following axioms:
The last axiom is also known as the sum rule and can be generalised to non-mutually exclusive events: \(\Pr(A_{1} \cup A_{2}) = \Pr(A_{1}) + \Pr(A_{2}) - \Pr(A_{1} \cap A_{2})\).
Sum rule (for two mutually exclusive events): \(\Pr(A_{1} \cup A_{2}) = \Pr(A_{1}) + \Pr(A_{2}) - \Pr(A_{1} \cap A_{2})\).
Think about the probability of getting an odd number with a fair die. We may also write it as \(\Pr(x = 1) + \Pr(x = 3) + \Pr(x = 5) = \frac{3}{6}\).
Product rule (for two independent events): \(\Pr(A_{1} \cap A_{2}) = \Pr(A_{1}) \times \Pr(A_{2})\).
Think about the probability of getting two 6s in a row with a fair die. We may also write it as \(\Pr(x = 6, y = 6) = \frac{1}{6} \times \frac{1}{6} = \frac{1}{36}\).
If you understand and remember these two rules, you already know Bayesian statistics!
Probability that die \(x\) is equal to 2 and die \(y\) is equal to 3 is: \(\Pr(x = 2, y = 3) = \Pr(y = 3, x = 2) = \frac{1}{36}\).
With more than one variable, the sum rule tells us how to ignore one. For instance, the probability that the first die shows 1 is: \(\Pr(x = 1) = \Pr(x = 1, y \in \{1, 2, 3, 4, 5, 6\}) = \frac{6}{36}\). This is called marginal because you can write the cumulative probability in the margin of a joint probability table.
Probability that two dice total 4 is: \(\Pr(x + y = 4) = \frac{3}{36}\).
What is the probability that die \(x\) equals some value given that the total is 4? For instance, the probability of die \(x\) being equal to 2: \(\Pr(x = 2 \given x + y = 4) = \frac{1}{3}\).
This conditional probability can be rewritten: \(\Pr(x = 2 \given x + y = 4) = \frac{\Pr(x = 2, x + y = 4) }{\Pr(x + y = 4) } = \frac{1/36}{3/36} = \frac{1}{3}\). Note that \(\Pr(x \given y)\) is not necessarily equal (and is generally not equal) to \(\Pr(y \given x)\).
For instance: the probability of dying knowing that you have been attacked by a shark is not the same as the probability of having been attacked by a shark knowing that you are dead (confusion of the inverse). In the same way, \(p(\text{data} \given \mathcal{H}_{0}) \neq p(\mathcal{H}_{0} \given \text{data})\).
From Kolmogorov’s axioms and the previous definitions of joint, marginal, and conditional probabilities, we derive the general version (i.e., not necessarily for independent events) of the product rule:
\[p(x,y) = p(x \given y) \ p(y) = p(y \given x) \ p(x)\]
\[p(y \given x) \ p(x) = p(x \given y) \ p(y)\]
Then divide each side by \(p(x)\):
\[p(y \given x) = \dfrac{p(x \given y) \ p(y)}{p(x)}\]
\[p(x \given y) = \dfrac{p(y \given x) \ p(x)}{p(y)}\]
If we replace \(x\) by \(\text{hypothesis}\) and \(y\) by \(\text{data}\):
\[ \Pr(\text{hypothesis} \given \text{data}) = \frac{\Pr(\text{data} \given \text{hypothesis}) \times \Pr(\text{hypothesis})}{\text{sum of products}} \]
Let’s imagine we have a bag containing 4 marbles. These marbles can be either white or blue. We know that there are precisely 4 marbles, but we don’t know the number of marbles of each colour.
However, we do know that there are five possibilities (which we consider to be our hypotheses):
⚪ ⚪ ⚪ ⚪
🔵 ⚪ ⚪ ⚪
🔵 🔵 ⚪ ⚪
🔵 🔵 🔵 ⚪
🔵 🔵 🔵 🔵
The aim is to determine which combination is the most likely, given certain observations. Let’s assume that we drawn three marbles in a row, with replacement, and we obtained the following sequence: 🔵 ⚪ 🔵.
This sequence represents our data. What inference can we make about the contents of the bag? In other words, what can we say about the probability of each hypothesis?
⚪ ⚪ ⚪ ⚪
🔵 ⚪ ⚪ ⚪
🔵 🔵 ⚪ ⚪
🔵 🔵 🔵 ⚪
🔵 🔵 🔵 🔵
02:00
Under this hypothesis, \(3\) paths (out of \(4^{3} = 64\)) lead to the observed data. What about the other hypotheses?
Given the data, the most probable hypothesis is the one that maximises the number of possible ways of obtaining the observed data.
Hypothesis | Ways to obtain the data |
---|---|
⚪ ⚪ ⚪ ⚪ | 0 x 4 x 0 = 0 |
🔵 ⚪ ⚪ ⚪ | 1 x 3 x 1 = 3 |
🔵 🔵 ⚪ ⚪ | 2 x 2 x 2 = 8 |
🔵 🔵 🔵 ⚪ | 3 x 1 x 3 = 9 |
🔵 🔵 🔵 🔵 | 4 x 0 x 4 = 0 |
In the previous case, we considered that all the hypotheses were equally probable a priori (according to the principle of indifference). However, we could have a priori information from our knowledge (of the characteristics of the bags of marbles, for example) or from previous data.
Let’s assume we draw a new marble from the bag. How do we incorporate this new observation?
All we have to do is apply the same strategy as before, and update the last count by multiplying it by the new data. Yesterday’s posterior is today’s prior (Lindley, 2000).
Hypothesis | Ways to produce 🔵 | Previous count | New count |
---|---|---|---|
⚪ ⚪ ⚪ ⚪ | 0 | 0 | 0 x 0 = 0 |
🔵 ⚪ ⚪ ⚪ | 1 | 3 | 3 x 1 = 3 |
🔵 🔵 ⚪ ⚪ | 2 | 8 | 8 x 2 = 16 |
🔵 🔵 🔵 ⚪ | 3 | 9 | 9 x 3 = 27 |
🔵 🔵 🔵 🔵 | 4 | 0 | 0 x 4 = 0 |
Now let’s suppose that an employee at the marble factory tells us that blue marbles are rare… This employee tells us that for every bag containing 3 blue marbles, they make two bags containing only two, and three bags containing only one. He also tells us that every bag contains at least one blue marble and one white marble…
Hypothesis | Previous count | Factory prior | New count |
---|---|---|---|
⚪ ⚪ ⚪ ⚪ | 0 | 0 | 0 x 0 = 0 |
🔵 ⚪ ⚪ ⚪ | 3 | 3 | 3 x 3 = 9 |
🔵 🔵 ⚪ ⚪ | 16 | 2 | 16 x 2 = 32 |
🔵 🔵 🔵 ⚪ | 27 | 1 | 27 x 1 = 27 |
🔵 🔵 🔵 🔵 | 0 | 0 | 0 x 0 = 0 |
The probability of a hypothesis after observing certain data is proportional to the number of ways in which this hypothesis can produce the observed data, multiplied by its a priori probability.
\[ \Pr(\text{hypothesis} \given \text{data}) \propto \Pr(\text{data} \given \text{hypothesis}) \times \Pr(\text{hypothesis}) \]
To convert plausibilities to probabilities, all we have to do is standardise these plausibilities so that the sum of the plausibilities of all possible hypotheses is equal to \(1\).
\[ \Pr(\text{hypothesis} \given \text{data}) = \frac{\Pr(\text{data} \given \text{hypothesis})\times \Pr(\text{hypothesis})}{\text{sum of products}} \]
We define \(p\) as the proportion of blue marbles in the bag.
Hypothesis | \(p\) | Ways to produce the data | Probability |
---|---|---|---|
⚪ ⚪ ⚪ ⚪ | 0 | 0 | 0 |
🔵 ⚪ ⚪ ⚪ | 0.25 | 3 | 0.15 |
🔵 🔵 ⚪ ⚪ | 0.5 | 8 | 0.40 |
🔵 🔵 🔵 ⚪ | 0.75 | 9 | 0.45 |
🔵 🔵 🔵 🔵 | 1 | 0 | 0 |
\[ \color{purple}{p(\theta \given x)} = \dfrac{\color{orangered}{p(x \given \theta)} \color{steelblue}{p(\theta)}}{\color{green}{p(x)}} = \dfrac{\color{orangered}{p(x \given \theta)} \color{steelblue}{p(\theta)}}{\color{green}{\sum\limits_{\theta}p(x \given \theta)p(\theta)}} = \dfrac{\color{orangered}{p(x \given \theta)} \color{steelblue}{p(\theta)}}{\color{green}{\int\limits_{\theta}p(x \given \theta)p(\theta)\mathrm{d}x}} \propto \color{orangered}{p(x \given \theta)} \color{steelblue}{p(\theta)} \]
In this framework, for each problem, we will follow 3 steps:
Bayesian inference is really just counting and comparing of possibilities […] in order to make good inference about what actually happened, it helps to consider everything that could have happened (McElreath, 2016).
The first idea is that Bayesian inference is reallocation of credibility across possibilities. The second foundational idea is that the possibilities, over which we allocate credibility, are parameter values in meaningful mathematical models (Kruschke, 2015).
In women aged 40-50, with no family history and no symptoms, the probability of developing breast cancer is around \(0.008\).
Known properties of mammography:
Suppose a woman has a mammogram and the test is positive. What should be inferred? What is the probability that this woman has breast cancer?
This approach leads to the conclusion that cancer is present (because it maximises the probability of a positive mammogram)…
\[\Pr(\text{Cancer=+} \given \text{Mam=+}) = \frac{7}{7 + 69} = \frac{7}{76} \approx 0.09\]
\[ \color{purple}{p(\theta \given x)} = \dfrac{\color{orangered}{p(x \given \theta)} \color{steelblue}{p(\theta)}}{\color{green}{p(x)}} \]
\(\color{steelblue}{p(\theta)}\) represents the prior probability of \(\theta\): what we know about \(\theta\) before observing the data. In this case: \(\Pr(\text{Cancer=+}) = 0.008\) and \(\Pr(\text{Cancer=-}) = 0.992\).
\[ \color{purple}{p(\theta \given x)} = \dfrac{\color{orangered}{p(x \given \theta)} \color{steelblue}{p(\theta)}}{\color{green}{p(x)}} \]
\(\color{orangered}{p(x \given \theta)}\) represents the conditional probability of the data \(x\) knowing the parameter \(\theta\), also known as the likelihood function of the parameter \(\theta\).
\[\color{purple}{p(\theta \given x)} = \dfrac{\color{orangered}{p(x \given \theta)} \color{steelblue}{p(\theta)}}{\color{green}{p(x)}}\]
\(p(x)\) the marginal probability of \(x\) (over \(\theta\)). This is a constant used to normalise the distribution (the “sum of products” from the previous example).
\[\color{green}{p(x) = \sum\limits_{\theta}p(x \given \theta)p(\theta)}\]
\[\color{purple}{p(\theta \given x)} = \dfrac{\color{orangered}{p(x \given \theta)} \color{steelblue}{p(\theta)}}{\color{green}{p(x)}}\]
\(\color{purple}{p(\theta \given x)}\) the posterior probability of \(\theta\) given \(x\), that is, what we know about \(\theta\) after seeing \(x\).
Before the mammogram, the probability of a woman drawn at random having breast cancer was \(\Pr(\text{Cancer=+}) = 0.008\) (prior). After a positive result, this probability became \(\Pr(\text{Cancer=+} \given \text{Mam=+}) = 0.09\) (posterior). These probabilities are expressions of our knowledge. After a positive mammogram, we still think it’s “very unlikely” to have cancer, but this probability has changed considerably compared with “before the test”.
A Bayesianly justifiable analysis is one that treats known values as observed values of random variables, treats unknown values as unobserved random variables, and calculates the conditional distribution of unknowns given knowns and model specifications using Bayes’ theorem (Rubin, 1984).
Applies to all situations where the data generation process can only result in two mutually exclusive outcomes (e.g., a coin toss). At each trial, if we assume that \(\Pr(\text{heads}) = \theta\), then \(\Pr(\text{tails}) = 1 - \theta\).
Since Bernoulli, we know how to compute the probability of the result of a coin toss, as long as we know the coin’s bias \(\theta\). Let’s assume that \(Y = 0\) when you get tails, and that \(Y = 1\) when you get heads. Then \(Y\) is distributed according to a Bernoulli distribution:
\[p(y \given \theta) = \Pr(Y = y \given \theta) = \theta^{y} (1 - \theta)^{(1 - y)}\]
If we replace \(y\) by \(0\) or \(1\), we come back to our previous observations:
\[\Pr(Y = 1 \given \theta) = \theta^{1} (1 - \theta)^{(1 - 1)} = \theta \times 1 = \theta\]
\[\Pr(Y = 0 \given \theta) = \theta^{0} (1 - \theta)^{(1 - 0)} = 1 \times (1 - \theta) = 1 - \theta\]
If we have a series of independent and identically distributed throws \(\{Y_i\}\) (i.e., each throw has a Bernoulli probability distribution with probability \(\theta\)), the set of throws can be described by a Binomial distribution.
For example, suppose we have the following sequence of five throws: Tails, Tails, Tails, Heads, Heads. We can recode this sequence into \(\{0, 0, 0, 1, 1\}\).
Reminder: The probability of each \(1\) is \(\theta\) and the probability of each \(0\) is \(1 - \theta\).
What is the probability of getting 2 heads out of 5 throws?
Knowing that the trials are independent of each other, the probability of obtaining this sequence is \((1 - \theta) \times (1 - \theta) \times (1 - \theta) \times \theta \times \theta\), that is: \(\theta^{2} (1 - \theta)^{3}\).
We can generalise this result for a sequence of \(n\) throws and \(y\) “successes”:
\[\theta^{y} (1 - \theta)^{n - y}\]
So far, we have only considered a single sequence resulting in 2 successes for 5 throws, but there are many sequences that can result in 2 successes for 5 throws (e.g., \(\{0, 0, 1, 0, 1\}\), \(\{0, 1, 1, 0, 0\}\))…
The binomial coefficient allows computing the number of possible combinations resulting in \(y\) successes for \(n\) throws in the following way (read “\(y\) among \(n\)” or “number of combinations of \(y\) among \(n\)”)1:
\[ \left(\begin{array}{l} n \\ y \end{array}\right) = C_{y}^{n} = \frac{n !}{y !(n - y) !} \]
For instance for \(y = 1\) and \(n = 3\), we know there are 3 possible combinations: \(\{0, 0, 1\}, \{0, 1, 0\}, \{1, 0, 0\}\). We can check this by applying the formula above.
\[ \left(\begin{array}{l} 3 \\ 1\end{array}\right) = C_{1}^{3} = \frac{3 !}{1 !(3 - 1) !} = \frac{3 \times 2 \times 1}{1 \times 2 \times 1} = \frac{6}{2} = 3 \]
\[
p(y \given \theta) = \Pr(Y = y \given \theta) = \left(\begin{array}{l} n \\ y \end{array}\right) \theta^{y}(1 - \theta)^{n - y}
\] The binomial distribution allows us to calculate the probability of obtaining \(y\) successes in \(n\) trials, for a given \(\theta\). Example of the binomial distribution for an unbiased coin (\(\theta = 0.5\)), indicating the probability of obtaining \(n\) heads in 10 throws (in R: dbinom(x = 0:10, size = 10, prob = 0.5)
).
library(tidyverse)
set.seed(666) # for reproducibility
sample(x = c(0, 1), size = 500, prob = c(0.4, 0.6), replace = TRUE) %>% # theta = 0.6
data.frame() %>%
mutate(x = seq_along(.), y = cummean(.) ) %>%
ggplot(aes(x = x, y = y) ) +
geom_line(lwd = 1) +
geom_hline(yintercept = 0.6, lty = 3) +
labs(x = "Number of trials", y = "Proportion of heads") +
ylim(0, 1)
Likelihood function:
The likelihood function is written as:
\[ \color{orangered}{\mathcal{L}(\theta \given y, n) = p(y \given \theta, n) = \left(\begin{array}{l} n \\ y \end{array}\right) \theta^{y}(1 - \theta)^{n - y} \propto \theta^{y}(1 - \theta)^{n - y}} \]
A coin with a bias \(\theta\) is tossed again (where \(\theta\) represents the probability of getting heads). This coin is tossed twice and we obtain a Heads and a Tails.
We can compute the probability of getting one Heads in two coin tosses as a function of \(\theta\) as follows:
\[ \begin{aligned} \Pr(H, T \given \theta) + \Pr(T, H \given \theta) &= 2 \times \Pr(T \given \theta) \times \Pr(H \given \theta) \\ &= \theta(1 - \theta) + \theta(1 - \theta) \\ &= 2 \theta(1 - \theta) \end{aligned} \]
This probability is defined for a fixed data set and a varying \(\theta\). This function can be represented visually.
# Graphical representation of the likelihood function for y = 1 and n = 2
y <- 1 # number of heads
n <- 2 # number of trials
data.frame(theta = seq(from = 0, to = 1, length.out = 1e3) ) %>%
mutate(likelihood = dbinom(x = y, size = n, prob = theta) ) %>%
ggplot(aes(x = theta, y = likelihood) ) +
geom_area(color = "orangered", fill = "orangered", alpha = 0.5) +
labs(x = expression(paste(theta, " - Pr(head)") ), y = "Likelihood")
If we calculate the area under the curve of this function, we get:
\[\int_{0}^{1} 2 \theta(1 - \theta) \mathrm{d} \theta = \frac{1}{3}\]
When we vary \(\theta\), the likelihood function is not a valid probability distribution (i.e., its integral is not equal to 1). We use the term likelihood to distinguish this type of function from probability density functions. We use the following notation to emphasise the fact that the likelihood function is a function of \(\theta\), and that the data are fixed: \(\mathcal{L}(\theta \given \text{data}) = p(\text{data} \given \theta)\).
Number of Heads (y)
|
||||
---|---|---|---|---|
theta | 0 | 1 | 2 | Total |
0 | 1.00 | 0.00 | 0.00 | 1 |
0.2 | 0.64 | 0.32 | 0.04 | 1 |
0.4 | 0.36 | 0.48 | 0.16 | 1 |
0.6 | 0.16 | 0.48 | 0.36 | 1 |
0.8 | 0.04 | 0.32 | 0.64 | 1 |
1 | 0.00 | 0.00 | 1.00 | 1 |
Total | 2.20 | 1.60 | 2.20 |
Note that the likelihood of \(\theta\) for a particular data item is equal to the probability of the data for this value of \(\theta\). However, the distribution of these likelihoods (in columns) is not a probability distribution. In a usual Bayesian analysis, the data are considered fixed and the value of \(\theta\) is considered a random variable.
How can we define a prior for \(\theta\)?
Semantic aspect \(~\rightarrow~\) the prior should be able to represent:
Mathematical aspect \(~\rightarrow~\) for a fully analytical solution:
\[ \begin{align} \color{steelblue}{p(\theta \given a, b)} \ &\color{steelblue}{= \mathrm{Beta}(\theta \given a, b)} \\ & \color{steelblue}{= \theta^{a - 1}(1 - \theta)^{b - 1} / B(a, b)} \\ & \color{steelblue}{\propto \theta^{a - 1}(1 - \theta)^{b - 1}} \end{align} \]
where \(a\) and \(b\) are two parameters such that \(a \geq 0\), \(b \geq 0\), and \(B(a, b)\) is a normalisation constant.
The level of certainty increases with the sum \(\kappa = a + b\).
Suppose we have an estimate of the most likely value of the parameter \(\theta\). We can re-parametrise the Beta distribution as a function of the mode \(\omega\) and the level of certainty \(\kappa\):
\[ \begin{align} a &= \omega(\kappa - 2) + 1 \\ b &= (1 - \omega)(\kappa - 2) + 1 &&\mbox{for } \kappa > 2 \end{align} \]
If \(\omega = 0.65\) and \(\kappa = 25\), then \(p(\theta) = \mathrm{Beta}(\theta \given 15.95, 9.05)\).
If \(\omega = 0.65\) and \(\kappa = 10\) then \(p(\theta) = \mathrm{Beta}(\theta \given 6.2, 3.8)\).
Formally, if \(\mathcal{F}\) is a class of sampling distributions \(p(y \given \theta)\), and \(\mathcal{P}\) is a class of prior distributions for \(\theta\), then the class \(\mathcal{P}\) is conjugate to \(\mathcal{F}\) if:
\[ p(\theta \given y) \in \mathcal{P} \text{ for all } p(\cdot \given \theta) \in \mathcal{F} \text{ and } p(\cdot) \in \mathcal{P} \]
(p.35, Gelman et al., 2013). In other words, a prior is called a conjugate prior if, when converted to a posterior by being multiplied by the likelihood, it keeps the same form. In our case, the Beta prior is a conjugate prior for the Binomial likelihood, because the posterior is a Beta distribution as well.
Assume a prior defined by: \(\ \color{steelblue}{p(\theta \given a, b) = \mathrm{Beta}(a, b) = \frac{\theta^{a - 1}(1 - \theta)^{b - 1}}{B(a, b)} \propto \theta^{a - 1}(1 - \theta)^{b - 1}}\)
Given a likelihood function associated with \(y\) “Heads” for \(n\) throws:
\(\ \color{orangered}{p(y \given n, \theta) = \mathrm{Bin}(y \given n, \theta) = \left(\begin{array}{l} n \\ y \end{array}\right) \theta^{y}(1 - \theta)^{n - y} \propto \theta^{y}(1 - \theta)^{n - y}}\)
Then (omitting the normalisation constants):
\[ \begin{align} \color{purple}{p(\theta \given y, n)} &\propto \color{orangered}{p(y \given n, \theta)} \ \color{steelblue}{p(\theta)} &&\mbox{Bayes theorem} \\ &\propto \color{orangered}{\mathrm{Bin}(y \given n, \theta)} \ \color{steelblue}{\mathrm{Beta}(\theta \given a, b)} \\ &\propto \color{orangered}{\theta^{y}(1 - \theta)^{n - y}} \ \color{steelblue}{\theta^{a - 1}(1 - \theta)^{b - 1}} &&\mbox{Application of previous formulas} \\ &\propto \color{purple}{\theta}^{\color{orangered}{y} + \color{steelblue}{a - 1}}\color{purple}{(1 - \theta)}^{\color{orangered}{n - y} + \color{steelblue}{b - 1}} &&\mbox{Grouping powers of identical terms} \\ \end{align} \]
Here, we have ignored the constants which do not depend on \(\theta\) (i.e., the number of combinations in the binomial likelihood function and the Beta function \(B(a, b)\) in the Beta prior).1 Taking them into account, we obtain a Beta posterior distribution of the following form:
\[\color{purple}{p(\theta \given y, n) = \mathrm{Beta}(y + a, n - y + b)}\]
We observe \(y = 7\) correct answers out of \(n = 10\) questions. We choose a prior \(\mathrm{Beta}(1, 1)\), that is, a uniform prior on \([0, 1]\). This prior is equivalent to a prior knowledge of 0 successes and 0 failures (i.e., a flat prior).
The posterior distribution is given by:
\[ \begin{align} \color{purple}{p(\theta \given y, n)} &\propto \color{orangered}{p(y \given n, \theta)} \ \color{steelblue}{p(\theta)} \\ &\propto \color{orangered}{\mathrm{Bin}(7 \given 10, \theta)} \ \color{steelblue}{\mathrm{Beta}(\theta \given 1, 1)} \\ &= \color{purple}{\mathrm{Beta}(y + a, n - y + b)} \\ &= \color{purple}{\mathrm{Beta}(8, 4)} \end{align} \]
The mean of the posterior distribution is given by:
\[ \color{purple}{\underbrace{\frac{y + a}{n + a + b}}_{posterior}} = \color{orangered}{\underbrace{\frac{y}{n}}_{data}} \underbrace{\frac{n}{n + a + b}}_{weight} + \color{steelblue}{\underbrace{\frac{a}{a + b}}_{prior}} \underbrace{\frac{a + b}{n + a + b}}_{weight} \]
Case where \(n < a + b, (n = 10, a = 4, b = 16)\).
Case where \(n = a + b, (n = 20, a = 4, b = 16)\).
Case where \(n > a + b, (n = 40, a = 4, b = 16)\).
The posterior distribution is always a compromise between the prior distribution and the likelihood function (Kruschke, 2015).
The more data we have, the less influence the prior has in estimating the posterior distribution (and vice versa). Warning: When the prior assigns a probability of 0 to certain values of \(\theta\), the model is unable to learn (these values are then considered “impossible”).
\[ \text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Marginal likelihood}} \propto \text{Likelihood} \times \text{Prior} \]
\[ p(\theta \given \text{data}) = \frac{p(\text{data} \given \theta) \times \ p(\theta)}{p(\text{data})} \propto p(\text{data} \given \theta) \times p(\theta) \]
If we zoom in on the marginal likelihood (also known as evidence)…
\[ \begin{align} \color{green}{p(\text{data})} &= \int p(\text{data}, \theta) \, \mathrm d\theta &&\mbox{Marginalising over } \theta \\ \color{green}{p(\text{data})} &= \int p(\text{data} \given \theta) \times p(\theta) \, \mathrm{d} \theta && \mbox{Applying the product rule} \end{align} \]
New issue: \(p(\text{data})\) is obtained by calculating the sum (for discrete discrete variables) or the integral (for continuous variables) of the joint density \(p(\text{data}, \theta)\) over all possible values of of \(\theta\). This can become tricky when the model includes many continuous parameters…
Let’s consider a model with two discrete parameters. The marginal likelihood is obtained as:
\[ p(\text{data}) = \sum_{\theta_{1}} \sum_{\theta_{2}} p(\text{data}, \theta_{1}, \theta_{2}) \]
Let’s now consider a model with two continuous parameters. The marginal likelihood is obtained as:
\[ p(\text{data}) = \int\limits_{\theta_{1}} \int\limits_{\theta_{2}} p(\text{data}, \theta_{1}, \theta_{2}) \mathrm{d} \theta_{1} \mathrm{d} \theta_{2} \]
There are three ways to get around this problem:
Analytical solution \(\longrightarrow\) Using a conjugate prior (e.g., the Beta-Binomial model).
Discretised solution \(\longrightarrow\) Computing the posterior on a finite set of points (grid method).
Approximated solution \(\longrightarrow\) The parameter space is “cleverly” sampled (e.g., MCMC methods, cf. Course n°03).
Problem: This solution is very restrictive. Ideally, the model (likelihood + prior) model should be defined on the basis of the interpretation of the parameters of these distributions, and not to facilitate the calculations…
Problem with the number of parameters… Refining the grid increases the calculation time:
The best supercomputer (Frontier) performs around \(1.194 \times 10^{18}\) operations per second. If we consider that it would have to perform 4 operations per node of the grid, it would take more time to go through the grid than the estimated age of the universe (approximately \((4.36 ± 0.012) \times 10^{17}\) seconds)…
To sample (cleverly) a posterior distribution, we can use different implementations of MCMC methods (e.g., Metropolis-Hastings, Hamiltonian Monte Carlo) which we will discuss in Course n°03. In the meantime, we will work with samples from the posterior distribution i) to get used to results from MCMC methods and ii) because it is simpler to compute summary statistics (e.g., mean or credible intervals) on samples rather than by computing integrals.
p_grid <- seq(from = 0, to = 1, length.out = 1000) # creates a grid
prior <- rep(1, 1000) # uniform prior
likelihood <- dbinom(x = 12, size = 20, prob = p_grid) # computes likelihood
posterior <- (likelihood * prior) / sum(likelihood * prior) # computes posterior
samples <- sample(x = p_grid, size = 1e3, prob = posterior, replace = TRUE) # sampling
hist(samples, main = "", xlab = expression(theta) ) # histogram
Analytical solution for the Beta-Binomial model:
Grid method:
Analytical solution
Grid method
Estimation of the central tendency: From a set of samples of samples from a posterior distribution, we can calculate the mean mean, mode, and median. For example, for a uniform prior, 10 coin tosses and 3 heads.
What is the probability that the bias of the coin \(\theta\) is greater than 0.5?
Properties of the highest density interval:
Definition: the values of the parameter \(\theta\) contained in an HDI at 89% are such that \(p(\theta) > W\) where \(W\) satisfies the following condition:
\[\int_{\theta \ : \ p(\theta) > W} p(\theta) \, \mathrm{d} \theta = 0.89.\]
This procedure can be used to accept or reject a null value. The region of practical equivalence or region of practical equivalence (ROPE) defines an interval of values that are considered to be “equivalent” to the null value. The figure below summarises the possible decisions resulting from this procedure (Kruschke, 2018).
The value of the parameter (e.g., \(\theta = 0.5\)) is rejected if the HDI is entirely outside the ROPE. The parameter value (e.g., \(\theta = 0.5\)) is accepted if the HDI is entirely within the ROPE. If the HDI and the ROPE overlap, we cannot conclude…
We tossed coin 200 times and got 115 “Heads”. Is the coin biased? We build two models and try to find out which one best accounts for the data.
\[\begin{eqnarray*} \left\{ \begin{array}{ll} \mathcal{M}_0: Y \sim \mathrm{Binomial}(n, \theta = 0.5) & \quad \text{The coin is not biased}\\ \mathcal{M}_1: Y \sim \mathrm{Binomial}(n, \theta \neq 0.5) & \quad \text{The coin is biased} \end{array}\right. \end{eqnarray*}\]The Bayes factor is the ratio of the marginal likelihoods (of the two models).
\[\frac{p(\mathcal{M}_{0} \given \text{data})}{p(\mathcal{M}_{1} \given \text{data})} = \color{green}{\frac{p(\text{data} \given \mathcal{M}_{0})}{p(\text{data} \given \mathcal{M}_{1})}} \times \frac{p(\mathcal{M}_{0})}{p(\mathcal{M}_{1})}\]
The Bayes factor is the ratio of the marginal likelihoods (of the two models).
\[\frac{p(\mathcal{M}_{0} \given \text{data})}{p(\mathcal{M}_{1} \given \text{data})} = \color{green}{\frac{p(\text{data} \given \mathcal{M}_{0})}{p(\text{data} \given \mathcal{M}_{1})}} \times \frac{p(\mathcal{M}_{0})}{p(\mathcal{M}_{1})}\]
In our example:
\[\text{BF}_{01} = \color{green}{\frac{p(\text{data} \given \mathcal{M}_{0})}{p(\text{data} \given \mathcal{M}_{1})}} = \frac{0.005955}{0.004975} \approx 1.1971.\]
This BF indicates that the (prior) odds ratio increased (or should be updated) by 20% in favour of \(\mathcal{M}_{0}\) after observing the data. The Bayes factor can also be interpreted as follows: The data are approximately 1.2 times more likely under the \(\mathcal{M}_{0}\) model than under the \(\mathcal{M}_{1}\) model.
Two roles of the likelihood function:
This probability distribution can be used to generate data… !
In a Bayesian models, there are two sources of uncertainty when generating predictions:
An analyst who works in a factory that makes famous Swedish bread rolls read a book that raised a thorny question… Why does the toast always land on the butter side? Failing to come up with a plausible answer, she set out to verify this assertion. The first experiment was to drop a slice of buttered bread from the height of a table. The results of this experiment are available in the tartine1
dataset from the imsb
package.
First task: Retrieving the data.
Since the toast only has two sides, the result is similar to a draw according to a binomial distribution with an unknown parameter \(\theta\). What is the posterior distribution of the parameter \(\theta\) given these data and given that the analyst had no prior knowledge (you can also use your own prior)?
Calculate the 95% HDI of the posterior distribution and give a graphical representation of the result (hint: use the function imsb::posterior_plot()
).
Can the null hypothesis that \(\theta = 0.5\) be rejected? Answer this question using the HDI+ROPE procedure.
Import the tartine2
data from the imsb
package. Update the model using the mode of the posterior distribution calculated previously.
Since the toast only has two sides, the result is similar to a draw according to a binomial distribution with an unknown parameter \(\theta\). What is the posterior distribution of the parameter \(\theta\) given these data?
# number of trials
nbTrial <- length(data$trial)
# number of "successes" (i.e., when the toast lands on the butter side)
nbSuccess <- sum(data$side)
# size of the grid
grid_size <- 1e3
# generating the grid
p_grid <- seq(from = 0, to = 1, length.out = grid_size)
# uniform prior
prior <- rep(1, grid_size)
# computing the likelihod
likelihood <- dbinom(x = nbSuccess, size = nbTrial, prob = p_grid)
# computing the posterior
posterior <- likelihood * prior / sum(likelihood * prior)
Calculate the 95% HDI of the posterior distribution and give a graphical representation of the result.
Can the null hypothesis that \(\theta = 0.5\) be rejected? No, because the HDI overlaps with the ROPE…
At this point, no conclusion can be drawn. The analyst decides to repeat a series of observations to refine her results.
We use the previous posterior as the prior for this new model.
mode1 <- find_mode(samples)
prior2 <- dbeta(p_grid, mode1 * (nbTrial - 2) + 1, (1 - mode1) * (nbTrial - 2) + 1)
data.frame(x = p_grid, y = prior2) %>%
ggplot(aes(x = x, y = y) ) +
geom_area(alpha = 0.8, fill = "steelblue") +
geom_line(size = 0.8) +
labs(x = expression(theta), y = "Probability density")
likelihood2 <- dbinom(x = nbSucces2, size = nbTrial2, prob = p_grid)
posterior2 <- likelihood2 * prior2 / sum(likelihood2 * prior2)
samples2 <- sample(p_grid, prob = posterior2, size = 1e4, replace = TRUE)
posterior_plot(
samples = samples2, credmass = 0.95,
compval = 0.5, rope = c(0.49, 0.51)
) + labs(x = expression(theta) )
Ladislas Nalborczyk - IBSM2023