A course with R, Stan, and brms
Course n°01: Introduction to Bayesian inference, Beta-Binomial model
Course n°02: Introduction to brms, linear regression
Course n°03: Markov Chain Monte Carlo, generalised linear model
Course n°04: Multilevel models, cognitive models
\[\newcommand\given[1][]{\:#1\vert\:}\]
The aim is to build a model that can learn at several levels, a model that can produce estimates that will be informed by the different groups present in the data. We will follow the following example throughout this course.
Let’s assume that we’ve built a robot that visits cafés and measures the waiting time after ordering a coffee. This robot visits 20 different cafés, 5 times in the morning and 5 times in the afternoon, and measures the time (in minutes) it takes to get a coffee.
cafe afternoon wait
1 1 0 4.9989926
2 1 1 2.2133944
3 1 0 4.1866730
4 1 1 3.5624399
5 1 0 3.9956779
6 1 1 2.8957176
7 1 0 3.7804582
8 1 1 2.3844837
9 1 0 3.8617982
10 1 1 2.5800004
11 2 0 2.7421223
12 2 1 1.3525907
13 2 0 2.5215095
14 2 1 0.9628102
15 2 0 1.9543977
An initial model can be built, estimating the average time (across all cafés combined) to be served.
\[ \begin{align} \color{orangered}{w_{i}} \ &\color{orangered}{\sim \mathrm{Normal}(\mu_{i}, \sigma)} \\ \color{black}{\mu_{i}} \ &\color{black}{= \alpha} \\ \color{steelblue}{\alpha} \ &\color{steelblue}{\sim \mathrm{Normal}(5, 10)} \\ \color{steelblue}{\sigma} \ &\color{steelblue}{\sim \mathrm{HalfCauchy}(0, 2)} \\ \end{align} \]
\[ p(x \given x_{0}, \gamma) = \left(\pi \gamma \left[1 + \left(\frac{x-x_{0}}{\gamma}\right)^{2}\right] \right)^{-1} \]
Second model which estimates one intercept per café. Equivalent to constructing 20 dummy variables.
\[ \begin{align} \color{orangered}{w_{i}} \ &\color{orangered}{\sim \mathrm{Normal}(\mu_{i}, \sigma)} \\ \color{black}{\mu_{i}} \ &\color{black}{= \alpha_{\text{café}[i]}} \\ \color{steelblue}{\alpha_{\text{café}[i]}} \ &\color{steelblue}{\sim \mathrm{Normal}(5, 10)} \\ \color{steelblue}{\sigma} \ &\color{steelblue}{\sim \mathrm{HalfCauchy}(0, 2)} \\ \end{align} \]
Estimate Est.Error Q2.5 Q97.5
b_factorcafe1 3.451085 0.2610853 2.9512768 3.974664
b_factorcafe2 1.724707 0.2618685 1.2119623 2.232001
b_factorcafe3 3.321328 0.2591069 2.8128314 3.818617
b_factorcafe4 2.795010 0.2491789 2.3113631 3.271451
b_factorcafe5 1.461022 0.2614662 0.9426324 1.981485
b_factorcafe6 3.637604 0.2580897 3.1300960 4.135650
b_factorcafe7 2.943423 0.2566098 2.4467521 3.447219
b_factorcafe8 3.167672 0.2568818 2.6596383 3.673230
b_factorcafe9 3.334906 0.2577492 2.8215052 3.834208
b_factorcafe10 3.096252 0.2613306 2.5830027 3.594380
b_factorcafe11 1.919792 0.2688887 1.4079115 2.432918
b_factorcafe12 3.487732 0.2518169 2.9864178 3.968451
b_factorcafe13 3.219461 0.2626960 2.7083647 3.729026
b_factorcafe14 2.630692 0.2653149 2.0847316 3.159034
b_factorcafe15 3.476996 0.2600783 2.9567007 3.996322
b_factorcafe16 3.005773 0.2583324 2.5112834 3.514806
b_factorcafe17 3.879522 0.2627024 3.3518362 4.399417
b_factorcafe18 5.527586 0.2586780 5.0194691 6.029890
b_factorcafe19 2.972815 0.2567997 2.4740427 3.487078
b_factorcafe20 3.364395 0.2704167 2.8278114 3.897341
Couldn’t we ensure that the time measured at café 1 informs the measurement taken at café 2 and café 3? As well as the average time taken to be served? We’re going to learn the priors from the data…
\[ \begin{align} \text{Level 1}: \color{orangered}{w_{i}} \ &\color{orangered}{\sim \mathrm{Normal}(\mu_{i}, \sigma)} \\ \color{black}{\mu_{i}} \ &\color{black}{= \alpha_{\text{café}[i]}} \\ \text{Level 2}: \color{steelblue}{\alpha_{\text{café}}} \ &\color{steelblue}{\sim \mathrm{Normal}(\alpha,\sigma_{\text{café}})} \\ \color{steelblue}{\alpha} \ &\color{steelblue}{\sim \mathrm{Normal}(5, 10)} \\ \color{steelblue}{\sigma_{\text{café}}} \ &\color{steelblue}{\sim \mathrm{HalfCauchy}(0, 2)} \\ \color{steelblue}{\sigma} \ &\color{steelblue}{\sim \mathrm{HalfCauchy}(0, 2)} \\ \end{align} \]
The prior for the intercept of each coffee (\(\alpha_{\text{café}}\)) is now a function of two parameters (\(\alpha\) and \(\sigma_{\text{café}}\)). \(\alpha\) and \(\sigma_{\text{café}}\) are called hyper-parameters, they are parameters for parameters, and their priors are called hyperpriors. There are two levels in the model…
\[ \begin{align} \color{orangered}{w_{i}} \ &\color{orangered}{\sim \mathrm{Normal}(\mu_{i}, \sigma)} \\ \color{black}{\mu_{i}} \ &\color{black}{= \alpha_{\text{café}[i]}} \\ \color{steelblue}{\alpha_{\text{café}}} \ &\color{steelblue}{\sim \mathrm{Normal}(\alpha,\sigma_{\text{café}})} \\ \end{align} \]
NB: \(\alpha\) is defined here in the prior for \(\alpha_{\text{café}}\) but it could also be defined in the linear model:
\[ \begin{align} \color{orangered}{w_{i}} \ &\color{orangered}{\sim \mathrm{Normal}(\mu_{i}, \sigma)} \\ \color{black}{\mu_{i}} \ &\color{black}{= \alpha + \alpha_{\text{café}[i]}} \\ \color{steelblue}{\alpha_{\text{café}}} \ &\color{steelblue}{\sim \mathrm{Normal}(0,\sigma_{\text{café}})} \\ \end{align} \]
We can always “remove” the mean from a Gaussian distribution and consider it as a constant plus a Gaussian centred on zero.
NB: when \(\alpha\) is defined in the linear model, the \(\alpha_{\text{café}}\) represent deviations from the mean intercept. It is therefore necessary to add \(\alpha\) and \(\alpha_{\text{café}}\) to obtain the average waiting time per café…
This model has 23 parameters, the general intercept \(\alpha\), the residual variability \(\sigma\), the variability between cafés and one intercept per café.
The James-Stein estimator is defined as \(z = \bar{y} + c(y - \bar{y})\), where \(\bar{y}\) is the sample mean, \(y\) is an individual observation, and \(c\) is a constant, the shrinking factor (Efron & Morris, 1977).
The shrinking factor is determined both by the variability (imprecision) of the measurement (e.g., its standard deviation) and by the distance to the mean estimate (i.e., \(y - \bar{y}\)). In other words, this estimator is less “confident” about (i.e., gives less weight to) imprecise and/or extreme observations. In practice, shrinkage acts as a safeguard against overlearning (overfitting).
The shrinkage observed on the previous slide is due to information pooling between cafés. The estimate of the intercept for each café informs the intercept estimates of the other cafés, as well as the estimate of the general intercept (i.e., the overall average waiting time).
There are generally three perspectives (or strategies):
Complete pooling: the waiting time is assumed to be invariant, a common intercept (mod1
) is estimated.
No pooling: it is assumed that each café’s waiting time is unique and independent: an intercept is estimated for each café, but without informing the higher level (mod2
).
Partial pooling: an adaptive prior is used, as in the previous example (mod3
).
The complete pooling strategy generally underfits the data (low predictive capacity) whereas the no pooling strategy amounts to overfitting the data (low predictive capacity here too). The partial pooling strategy (i.e., that of multilevel models) balances underfitting and overfitting.
We can compare these models using indices derived from information theory (extensions of AIC), such as the WAIC (the lower the better).
# computing the WAIC for each model and storing it
mod1 <- add_criterion(mod1, "waic")
mod2 <- add_criterion(mod2, "waic")
mod3 <- add_criterion(mod3, "waic")
# comparing these WAICs
w <- loo_compare(mod1, mod2, mod3, criterion = "waic")
print(w, simplify = FALSE)
elpd_diff se_diff elpd_waic se_elpd_waic p_waic se_p_waic waic se_waic
mod3 0.0 0.0 -253.8 8.3 18.2 1.5 507.6 16.6
mod2 -0.7 1.3 -254.5 8.4 19.6 1.6 509.1 16.8
mod1 -57.3 10.6 -311.1 10.5 2.0 0.3 622.2 21.1
We note that model 3 has only 18 effective parameters (pWAIC) and fewer parameters than model 2, whereas it actually has 2 more… posterior_summary(mod3)[3, 1]
gives us the sigma of the adaptive prior on \(\alpha_{\text{café}}\) (\(\sigma_{\text{café}} = 0.82\)). Note that this sigma is very low and corresponds to assigning a very restrictive or regularising prior.
We compare the estimates from the first (complete pooling) and third (partial pooling) model.
Estimate Est.Error Q2.5 Q97.5
b_Intercept 3.118251 0.08020407 2.956309 3.268246
sigma 1.143048 0.05675120 1.035605 1.259202
Estimate Est.Error Q2.5 Q97.5
b_Intercept 3.1226606 0.2075211 2.7209592 3.5347898
sigma 0.8221927 0.0440647 0.7409711 0.9137384
Both models make the same prediction (on average) for \(\alpha\), but model 3 is more uncertain of its prediction than model 1 (see the standard error for \(\alpha\))…
The \(\sigma\) estimate of model 3 is smaller than that of model 1 because model 3 decomposes the unexplained variability into two sources: variability in waiting time between cafés and the residual variability \(\sigma\).
Let’s assume that our robot doesn’t visit all the cafés the same number of times (as in the previous case) but that it visits more often the cafés close to home…
We can see that cafés that are visited frequently (right) are less affected by the effect of shrinkage. Their estimates are less “pulled” towards the average than the estimates of the least frequently visited cafés (left).
Five (contradictory) definitions identified by Gelman (2005).
Gelman & Hill (2006) suggest instead the use of the terms constant effcts and varying effects, and to always use multilevel modelling, considering that the so-called fixed effect can simply be considered as a random effect whose variance would be equal to \(0\) (see also Nalborczyk et al., 2019).
Varying the intercepts for each café is simply another way of (adaptively) regularising, that is, reducing the weight given to the data in the estimation. The model becomes able to estimate the extent to which the groups (in this case the cafés) are different, while estimating the characteristics of each café…
Difference between cross-classified (or “crossed”) multilevel models and nested or hierarchical multilevel models. Cross-classified models refer to data structured according to two (or more) non-nested random factors. Hierarchical models usually refers to hierarchically structured data (e.g., a student in a class in a school in a city…). See this discussion for more details.
However, the two types of models are written in a similar way, on several “levels”. The term “multilevel” (in our terminology) therefore refers to the structure of the model, to its specification. It is distinct from the structure of the data.
We are now interested in the effect of the time of day on the waiting time. Do we wait more in the morning, or in the afternoon?
\[ \begin{align} \color{orangered}{w_{i}} \ &\color{orangered}{\sim \mathrm{Normal}(\mu_{i}, \sigma)} \\ \color{black}{\mu_{i}} \ &\color{black}{= \alpha_{\text{café}[i]} + \beta_{\text{café}[i]} \times A_{i}} \\ \end{align} \]
Where \(A_{i}\) is a dummy variable coded 0/1 for morning/afternoon and where \(\beta_{\text{café}}\) is therefore a difference parameter (i.e., a slope) between morning and afternoon.
Note: we know that cafés have intercepts and slopes that co-vary… Popular cafés will be overcrowded in the morning and much less in the afternoon, resulting in a negative slope. These cafés will also have a longer average waiting time (i.e., a larger intercept). In these cafés, \(\alpha\) is large and \(\beta\) is far from zero. Conversely, in an unpopular café, the waiting time will be short, as well as the difference between the morning and afternoon’s waiting time.
We could therefore use the co-variation between the intercept and slope to make better inferences. In other words, ensure that the estimate of the intercept informs the estimate of the slope, and reciprocally.
We are now interested in the effect of the time of day on the waiting time. Do we wait more in the morning, or in the afternoon?
\[ \begin{align} \color{orangered}{w_{i}} \ &\color{orangered}{\sim \mathrm{Normal}(\mu_{i}, \sigma)} \\ \color{black}{\mu_{i}} \ &\color{black}{= \alpha_{\text{café}[i]} + \beta_{\text{café}[i]} \times A_{i}} \\ \color{steelblue}{\begin{bmatrix} \alpha_{\text{café}} \\ \beta_{\text{café}} \\ \end{bmatrix}} \ &\color{steelblue}{\sim \mathrm{MVNormal}\bigg(\begin{bmatrix} \alpha \\ \beta \end{bmatrix}, \textbf{S}\bigg)} \\ \end{align} \]
The third line posits that every café has an intercept \(\alpha_{\text{café}}\) and a slope \(\beta_{\text{café}}\), defined by a bivariate (i.e., two-dimensional) Gaussian prior having as means \(\alpha\) and \(\beta\) and as covariance matrix \(\textbf{S}\).
\[\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\]
Where \(\boldsymbol{\mu}\) is a (\(k\)-dimensional) vector of means, for instance: mu <- c(a, b)
.
\(\boldsymbol{\Sigma}\) is a covariance matrix of \(k \times k\) dimensions, and which corresponds to the matrix given by the function vcov()
.
\[ \begin{align} \boldsymbol{\Sigma} &= \begin{pmatrix} \sigma_{\alpha}^{2} & \sigma_{\alpha} \sigma_{\beta} \rho \\ \sigma_{\alpha} \sigma_{\beta} \rho & \sigma_{\beta}^{2} \\ \end{pmatrix} \\ \end{align} \]
\[ \begin{align} \boldsymbol{\Sigma} &= \begin{pmatrix} \sigma_{\alpha}^{2} & \sigma_{\alpha} \sigma_{\beta} \rho \\ \sigma_{\alpha} \sigma_{\beta} \rho & \sigma_{\beta}^{2} \\ \end{pmatrix} \\ \end{align} \]
This matrix can be constructed in two different ways, strictly equivalent.
\[ \begin{align} \boldsymbol{\Sigma} &= \begin{pmatrix} \sigma_{\alpha}^{2} & \sigma_{\alpha} \sigma_{\beta} \rho \\ \sigma_{\alpha} \sigma_{\beta} \rho & \sigma_{\beta}^{2} \\ \end{pmatrix} \\ \end{align} \]
The second method is convenient because it considers separately the standard deviations and correlations.
\[ \begin{align} \color{orangered}{w_{i}} \ &\color{orangered}{\sim \mathrm{Normal}(\mu_{i}, \sigma)} \\ \color{black}{\mu_{i}} \ &\color{black}{= \alpha_{\text{café}[i]} + \beta_{\text{café}[i]} \times A_{i}} \\ \color{steelblue}{\begin{bmatrix} \alpha_{\text{café}} \\ \beta_{\text{café}} \\ \end{bmatrix}} \ &\color{steelblue}{\sim \mathrm{MVNormal}\bigg(\begin{bmatrix} \alpha \\ \beta \end{bmatrix}, \textbf{S}\bigg)} \\ \color{black}{\textbf{S}} \ &\color{black}{= \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \\ \end{pmatrix} \ \textbf{R} \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \\ \end{pmatrix}} \\ \color{steelblue}{\alpha} \ &\color{steelblue}{\sim \mathrm{Normal} (0, 10)} \\ \color{steelblue}{\beta} \ &\color{steelblue}{\sim \mathrm{Normal} (0, 10)} \\ \color{steelblue}{\sigma_{\alpha}} \ &\color{steelblue}{\sim \mathrm{HalfCauchy} (0, 2)} \\ \color{steelblue}{\sigma_{\beta}} \ &\color{steelblue}{\sim \mathrm{HalfCauchy} (0, 2)} \\ \color{steelblue}{\sigma} \ &\color{steelblue}{\sim \mathrm{HalfCauchy} (0, 2)} \\ \color{steelblue}{\textbf{R}} \ &\color{steelblue}{\sim \mathrm{LKJ}(2)} \\ \end{align} \]
\(\textbf{S}\) is defined by factoring \(\sigma_{\alpha}\), \(\sigma_{\beta}\), and the correlation matrix \(\textbf{R}\). The next lines of the model simply define priors for constant effects. The last line specifies the prior for \(\textbf{R}\).
Prior proposed by Lewandowski et al. (2009). A single parameter \(\zeta\) (zeta) specifies the concentration of the distribution of the correlation coefficient. The \(\mathrm{LKJ}(2)\) prior defines an weakly informative prior for \(\rho\) (rho) which is sceptical of extreme correlations (i.e., values close to \(-1\) or \(1\)).
The brms
package uses the same syntax as R base functions (like lm
) or the lme4
package.
The left-hand side defines the dependent variable (or “outcome”, i.e., what we are trying to predict).
The first part of the right-hand side of the formula represents the constant effects (fixed effects), whereas the second part (between parentheses) represents varying effects (random effects).
The first model above contains only a varying intercept, which varies by Subject
. The second model contains a varying intercept, but also a varying slope for the effect of Days
.
When including several varying effects (e.g., an intercept and a slope), brms
assumes that we also want to estimate the correlation between these effects. Otherwise, we can remove this correlation (i.e., set it to 0) using ||
.
We specify an intercept and a slope (for the afternoon
effect) which vary by cafe
.
post <- as_draws_df(x = mod5) # extracting posterior samples
R <- rethinking::rlkjcorr(n = 16000, K = 2, eta = 2) # samples from prior
data.frame(prior = R[, 1, 2], posterior = post$cor_cafe__Intercept__afternoon) %>%
gather(type, value, prior:posterior) %>%
ggplot(aes(x = value, color = type, fill = type) ) +
geom_histogram(position = "identity", alpha = 0.2) +
labs(x = expression(rho), y = "Number of samples")
We compare the first model (complete pooling model), the third model (partial pooling model), and the last model (with varying intercept and slope).
mod5 <- add_criterion(mod5, "waic")
w <- loo_compare(mod1, mod2, mod3, mod5, criterion = "waic")
print(w, simplify = FALSE)
elpd_diff se_diff elpd_waic se_elpd_waic p_waic se_p_waic waic se_waic
mod5 0.0 0.0 -155.1 10.0 26.5 2.6 310.2 20.1
mod3 -98.7 8.3 -253.8 8.3 18.2 1.5 507.6 16.6
mod2 -99.5 8.3 -254.5 8.4 19.6 1.6 509.1 16.8
mod1 -156.0 13.7 -311.1 10.5 2.0 0.3 622.2 21.1
mod1 mod2 mod3 mod5
1.763502e-68 6.397539e-44 1.337590e-43 1.000000e+00
The estimate of the average waiting time is more uncertain when we takes into account new sources of error. However, the overall error of the model (i.e., what is not explained), the residual variation \(\sigma\), decreases…
Estimate Est.Error Q2.5 Q97.5
b_Intercept 3.118251 0.08020407 2.956309 3.268246
sigma 1.143048 0.05675120 1.035605 1.259202
Estimate Est.Error Q2.5 Q97.5
b_Intercept 3.1226606 0.2075211 2.7209592 3.5347898
sigma 0.8221927 0.0440647 0.7409711 0.9137384
Estimate Est.Error Q2.5 Q97.5
b_Intercept 3.740964 0.21779934 3.3144648 4.1799312
b_afternoon -1.232429 0.08622682 -1.4041576 -1.0647728
sigma 0.489634 0.02755458 0.4379875 0.5474008
Multilevel models (or “mixed-effects models”) are natural extensions of classical (single-level) regression models, where classical parameters are themselves assigned “models”, governed by hyper-parameters.
This extension makes it possible to make more precise predictions by taking into account the variability related to groups or structures (clusters) present in the data. In other words, by modelling the populations from which the varying effects are drawn (e.g., the population of participants or stimuli).
A single-level regression model is equivalent to a multilevel model where the variability of varying effects would be fixed at \(0\).
The Bayesian framework allows a natural interpretation of distributions from which the varying effects come. Indeed, these distributions can be interpreted as prior distributions, whose parameters are estimated from the data.
Reaction Days Subject
1 249.5600 0 308
2 258.7047 1 308
3 250.8006 2 308
4 321.4398 3 308
5 356.8519 4 308
6 414.6901 5 308
7 382.2038 6 308
8 290.1486 7 308
9 430.5853 8 308
10 466.3535 9 308
11 222.7339 0 309
12 205.2658 1 309
13 202.9778 2 309
14 204.7070 3 309
15 207.7161 4 309
16 215.9618 5 309
17 213.6303 6 309
18 217.7272 7 309
19 224.2957 8 309
20 237.3142 9 309
It’s up to you to build the mathematical models and brms
models corresponding to the following models:
Days
.Days
+ a random effect of Subject
(varying intercept).Days
+ a random effect of Subject
(varying intercept + varying slope for Days
).Then, compare these models using model comparison tools and conclude.
# frequentist (flat-priors) models
fmod0 <- lm(Reaction ~ Days, sleepstudy)
fmod1 <- lmer(Reaction ~ Days + (1 | Subject), sleepstudy)
fmod2 <- lmer(Reaction ~ Days + (1 + Days | Subject), sleepstudy)
# comparing fmod1 and fmod2
anova(fmod1, fmod2)
Data: sleepstudy
Models:
fmod1: Reaction ~ Days + (1 | Subject)
fmod2: Reaction ~ Days + (1 + Days | Subject)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
fmod1 4 1802.1 1814.8 -897.04 1794.1
fmod2 6 1763.9 1783.1 -875.97 1751.9 42.139 2 7.072e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# computing and storing the WAIC of each model
mod6 <- add_criterion(mod6, "waic")
mod7 <- add_criterion(mod7, "waic")
mod8 <- add_criterion(mod8, "waic")
# comparing the WAICs of these models
w <- loo_compare(mod6, mod7, mod8, criterion = "waic")
print(w, simplify = FALSE)
elpd_diff se_diff elpd_waic se_elpd_waic p_waic se_p_waic waic se_waic
mod8 0.0 0.0 -860.0 22.2 32.5 8.1 1720.0 44.4
mod7 -24.7 11.5 -884.7 14.4 19.2 3.3 1769.4 28.8
mod6 -93.3 20.8 -953.3 10.6 3.2 0.5 1906.6 21.1
mod6 mod7 mod8
3.025212e-41 1.838528e-11 1.000000e+00
One of the most basic problem in scientific inference is the so-called inverse problem: How to figure out causes from observations. It is a problem, because many different causes can produce the same evidence. So while it can be easy to go forward from a known cause to predicted observations, it can very hard to go backwards from observation to cause (McElreath, 2020).
So far, we have only considered statistical models. These models are useful devices to describe associations, but they tell us nothing about how these associations arise. In the last part of the course, we will focus on process models, aiming at describing the mechanisms generating the data (generative models).
Two-alternative forced choice (2AFC) is a method for measuring the sensitivity of a person or animal to some particular sensory input, stimulus, through that observer’s pattern of choices and response times to two versions of the sensory input. At each trial, the participant is forced to choose between two alternatives. For instance, in the random dot motion coherence task (below), the participant must make a choice response between two directions of motion (e.g., up or down or left or right), usually indicated by a motor response such as a saccade or pressing a button.
Reaction times (RTs) distributions are generally positively skewed, with the skewness increasing with task difficulty. We also know that the mean of the RTs is proportional to the standard deviation of the RTs. Increases in the difficulty usually lead to increased RTs and decreased accuracy. Moreover, changes in difficulty also produces regular changes in the distribution of RTs, most notably in its spread but not much in its shape (for a review, see Forstmann et al., 2016). Moreover, we often find a speed-accuracy trade-off in these tasks.
The use of simple statistical model (e.g., only analysing differences in group-level average RTs across conditions) is severely limited in such tasks. Therefore, several models have been proposed to account for the peculiarities of the data coming from these tasks as well as to relate it to the underlying cognitive processes.
There are typically three assumptions made by evidence accumulation models:
The drift-diffusion model (DDM) is a continuous-time evidence accumulation model for binary choice tasks (Ratcliff, 1978). It assumes that in each trial evidence is accumulated in a noisy (diffusion) process by a single accumulator. As shown below, evidence accumulation starts at some point (the starting point or “bias”) and continues until the accumulator hits one of the two decision bounds in which case the corresponding response is given. The total response time is the sum of the decision time from the accumulation process plus non-decisional components (Vandekerckhove et al., 2010; Wabersich & Vandekerckhove, 2014; Wagenmakers, 2009). This kind of model provides a decomposition of RT data that isolates components (of processing) from stimulus encoding to decision so that they can be studied individually (Ratcliff & McKoon, 2008; Wagenmakers et al., 2007).
In sum, the original DDM allows decomposing responses to a binary choice tasks and corresponding response times into four latent processes (from Singmann, 2017):1
The drift rate \(\delta\) (delta) is the average slope of the accumulation process towards the boundaries (i.e., it represents the average amount of evidence accumulated per unit time). The larger the (absolute value of the) drift rate, the stronger the evidence for the corresponding response option (thus quantifying the “ease of processing”).
The boundary separation \(\alpha\) (alpha) is the distance between the two decision bounds and can be interpreted as a measure of response caution, with a high \(\alpha\) corresponding to high caution.
The starting point (or bias) \(\beta\) (beta) of the accumulation process is a measure of response bias towards one of the two response boundaries.
The non-decision time \(\tau\) (tau) captures all non-decisional processes such as stimulus encoding and (motor) response processes.
The lexical decision task is a procedure used in many psychology and psycholinguistics experiments. The basic procedure involves measuring how quickly and accurately people classify stimuli as words or nonwords.
We will adapt the example from Singmann (2017) and analyse part of the data from Experiment 1 of Wagenmakers et al. (2008). The data comes from 17 participants performing a lexical decision task. Participants made decisions under speed or accuracy emphasis instructions in different experimental blocks. After removing some extreme RTs, we restrict the analysis to high-frequency words (frequency = high) and the corresponding high-frequency non-words (frequency = nw_high) to reduce estimation time. To setup the model, we also need a numeric response variable in which 0 corresponds to responses at the lower response boundary and 1 corresponds to responses at the upper boundary.
# loading the "speed_acc" data from the "rtdists" package
data(speed_acc, package = "rtdists")
# reshaping the data
df <- speed_acc %>%
# removing extreme RTs
filter(censor == FALSE) %>%
# removing ppt with id=2 (less observations than others)
filter(id != 2) %>%
# focusing on high-frequency words and non-words
filter(frequency %in% c("high", "nw_high") ) %>%
# converting the response variable to a numeric 0/1 variable
mutate(response2 = as.numeric(response == "word") ) %>%
# keeping only some proportion of the data (for computational ease)
filter(as.numeric(block) < 9) %>%
mutate(id = factor(id), block = factor(block) )
An important decision that has to be made before setting up a model is which parameters are allowed to differ between which conditions. One common constraint of the DDM is that parameters that are set before the evidence accumulation process starts (i.e., boundary separation, starting point, and non-decision time) cannot change based on stimulus characteristics that are not known to the participant before the start of the trial. Thus, the stimulus category, in the present case word versus non-word, is usually only allowed to affect the drift rate. We follow this constraint. Furthermore, all relevant variables are manipulated within-subject. Thus, the maximal varying-effects structure (Barr et al., 2013) can (and should) be implemented.
# defining the model formula (one "linear model" per parameter)
formula <- brmsformula(
# drift rate (delta)
rt | dec(response2) ~ 1 + condition * stim_cat + (1 + condition * stim_cat | id),
# boundary separation parameter (alpha)
bs ~ 1 + condition + (1 + condition | id),
# non-decision time (tau)
ndt ~ 1 + condition + (1 + condition | id),
# starting point or bias (beta)
bias ~ 1 + condition + (1 + condition | id)
)
# defining the contrasts
contrasts(df$condition) <- c(+0.5, -0.5)
contrasts(df$stim_cat) <- c(+0.5, -0.5)
# defining the priors
priors <- c(
# priors for the intercepts
prior(normal(0, 5), class = "Intercept"),
prior(normal(0, 1), class = "Intercept", dpar = "bs"),
prior(normal(0, 1), class = "Intercept", dpar = "ndt"),
prior(normal(0, 1), class = "Intercept", dpar = "bias"),
# priors for the slopes
prior(normal(0, 1), class = "b"),
# priors on the SD of the varying effects
prior(exponential(1), class = "sd")
)
We then fit this model using the brms::brm()
function. We run 8 chains for 5000 iterations and use the first 1000 iterations as warmup, resulting in a total of \(8 \times (5000 - 1000) = 32000\) posterior samples.
# specify initial values to help the model start sampling
# (with small variation between chains)
chains <- 8 # number of chains
epsilon <- 0.1 # variability in starting value for the NDT intercept
get_init_value <- function (x) list(Intercept_ndt = rnorm(n = 1, mean = x, sd = epsilon) )
inits_drift <- replicate(chains, get_init_value(-3), simplify = FALSE)
# fitting the model
fit_wiener <- brm(
formula = formula,
data = df,
# specifying the family and link functions for each parameter
family = wiener(
link = "identity", link_bs = "log",
link_ndt = "log", link_bias = "logit"
),
# comment this line to use default priors
prior = priors,
# list of initialisation values
init = inits_drift,
init_r = 0.05,
warmup = 1000, iter = 5000,
chains = chains, cores = chains,
# control = list(adapt_delta = 0.99, max_treedepth = 15),
# saves the model (as .rds) or loads it if it already exists
file = "models/ddm.rds",
# needed for hypothesis testing
sample_prior = TRUE
)
Our model can be written (in a simplified form, omitting the varying effects) as:
\[ \begin{aligned} \text{RT}_{i} &\sim \mathrm{DDM}(\alpha_{i}, \tau_{i}, \beta_{i}, \delta_{i}) && \mbox{Observation model for the RTs.} \\ \delta_{i} &= \beta_{0[\delta]} + \beta_{1[\delta]} \cdot \text{Condition}_{i} + \beta_{2[\delta]} \cdot \text{Stim_cat}_{i} + \ && \mbox{Linear model for the drift rate.} \\ & \ \ \ \ \ \beta_{3[\delta]} \cdot \text{Condition}_{i} \cdot \text{Stim_cat}_{i} \\ \log(\alpha_{i}) &= \beta_{0[\alpha]} + \beta_{1[\alpha]} \cdot \text{Condition}_{i} && \mbox{Linear model for the (log) boundary separation.} \\ \log(\tau_{i}) &= \beta_{0[\tau]} + \beta_{1[\tau]} \cdot \text{Condition}_{i} && \mbox{Linear model for the (log) non-decision time.} \\ \mathrm{logit}(\beta_{i}) &= \beta_{0[\beta]} + \beta_{1[\beta]} \cdot \text{Condition}_{i} && \mbox{Linear model for the (logit) bias.} \\ \beta_{0[\delta]} &\sim \mathrm{Normal}(0, 5) && \mbox {Prior on the intercept for the drift rate.} \\ \beta_{1[\delta]}, \beta_{2[\delta]}, \beta_{3[\delta]} &\sim \mathrm{Normal}(0, 1) && \mbox {Prior on the slopes for the drift rate.} \\ \beta_{0[\alpha]}, \beta_{0[\tau]}, \beta_{0[\beta]} &\sim \mathrm{Normal}(0, 1) && \mbox {Prior on the intercept for the other parameters.} \\ \beta_{1[\alpha]}, \beta_{1[\tau]}, \beta_{1[\beta]} &\sim \mathrm{Normal}(0, 1) && \mbox {Prior on the slopes for the other parameters.} \\ \end{aligned} \]
where \(i\) denotes observations (i.e., lines in the dataframe).
A powerful way to convey the relationship between response times and accuracy is using quantile probability plots (Ratcliff & Tuerlinckx, 2002) which show quantiles of the response times distribution (typically 0.1, 0.3, 0.5, 0.7, and 0.9) for correct and incorrect responses on the y-axis against probabilities of correct and incorrect responses for experimental conditions on the x-axis. The plot is built by first aggregating the data (cf. the detailed code online).
# aggregating the data using the qpf() function from
# https://vasishth.github.io/bayescogsci/book/ch-lognormalrace.html#sec-acccoding
df_qpf <- df %>%
mutate(acc = ifelse(as.character(stim_cat) == as.character(response), 1, 0) ) %>%
group_by(stim_cat, condition) %>%
qpf() %>%
ungroup()
head(df_qpf)
# A tibble: 6 × 6
stim_cat condition rt_q p q response
<fct> <fct> <dbl> <dbl> <dbl> <chr>
1 word accuracy 0.366 0.0147 0.1 incorrect
2 word accuracy 0.48 0.0147 0.3 incorrect
3 word accuracy 0.504 0.0147 0.5 incorrect
4 word accuracy 0.533 0.0147 0.7 incorrect
5 word accuracy 0.786 0.0147 0.9 incorrect
6 word accuracy 0.449 0.985 0.1 correct
This plot shows that words are recognised faster than non-words, that responses are generally faster in the “speed” than in the “accuracy” condition, and that incorrect responses seem more variable than correct responses.
The model fit is not so bad, but the model is unable to capture fast errors (bottom left), and more generally, extreme quantiles…
We first check whether there is a difference in drift rate between conditions for words and non-words. This shows that a non negligible part of the posterior mass is above zero, meaning there is some (weak) evidence that the drift rate is greater in the accuracy than in the speed condition.
samps <- drift_rate_samples_per_condition %>%
mutate(.value = if_else(stim_cat == "nonword", (-1) * .value, .value) ) %>%
pivot_wider(names_from = condition, values_from = .value) %>%
mutate(accuracy_speed_diff = accuracy - speed)
posterior_plot(
samples = sample(x = samps$accuracy_speed_diff, size = 1e3),
compval = 0, nbins = 30
) + labs(x = "Difference in drift rate (accuracy - speed)")
Recall that the boundary separation parameter can be interpreted as a measure of response caution (with high \(\alpha\) corresponding to high response caution), and that the linear model for this parameter is on the log scale (i.e., we used a log link function): \(\log(\alpha_{i}) = \beta_{0} + \beta_{1} \cdot \text{Condition}_{i}\). Therefore, we have to apply the inverse link function (i.e., \(\exp(\cdot)\)) to the parameter to be able to interpret it. Taking \(\exp(\beta_{1})\) gives the proportional change in the value of the boundary-separation parameter when we go from the speed to the accuracy condition (see upper right panel). In our case, \(\exp(\beta_{1}) \approx 0.4\), which means that going from the speed to the accuracy condition leads to an increase of approximately 40% in the value of the boundary-separation parameter. In other words, response caution is higher in the accuracy (lower right panel) than in the speed (lower left panel) condition.
# retrieving posterior samples
post <- as_draws_df(x = fit_wiener)
# retrieving the posterior samples for the boundary-separation
posterior_intercept_bs <- post$b_bs_Intercept
posterior_slope_bs <- post$b_bs_condition1
# computing the posterior distribution in the speed condition
posterior_bs_speed <- exp(posterior_intercept_bs - 0.5 * posterior_slope_bs)
# computing the posterior distribution in the accuracy condition
posterior_bs_accuracy <- exp(posterior_intercept_bs + 0.5 * posterior_slope_bs)
Recall that the non-decision time parameter can be interpreted as a measure of the time used by non-decisional processes such as stimulus encoding or motor response, and that the linear model for this parameter is on the log scale (i.e., we used a log link function): \(\log(\tau_{i}) = \beta_{0} + \beta_{1} \cdot \text{Condition}_{i}\). Therefore, we have to apply the inverse link function (i.e., \(\exp(\cdot)\)) to the parameter to be able to interpret it. Taking \(\exp(\beta_{1})\) gives the proportional change in the value of the non-decision time parameter when we go from the speed to the accuracy condition. In our case, \(\exp(\beta_{1}) \approx 1.12\) which means that going from the speed to the accuracy condition leads to an increase of approximately 12% of the non-decision time. In other words, non-decisional processes seem to take longer in the accuracy than in the speed condition.
# retrieves the posterior samples for the non-decision time
posterior_intercept_ndt <- post$b_ndt_Intercept
posterior_slope_ndt <- post$b_ndt_condition1
# computes the posterior distribution in the speed condition
posterior_ndt_speed <- exp(posterior_intercept_ndt - 0.5 * posterior_slope_ndt)
# computes the posterior distribution in the accuracy condition
posterior_ndt_accuracy <- exp(posterior_intercept_ndt + 0.5 * posterior_slope_ndt)
The starting point is a measure of response bias towards one of the two response boundaries and is bounded between 0 and 1. The linear model for this parameter is on the logit (log-odds) scale: \(\log(\frac{\beta_{i}}{1 - \beta_{i}}) = \beta_{0} + \beta_{1} \cdot \text{Condition}_{i}\). Therefore, we have to apply the inverse link function (i.e., \(\mathrm{logit}^{-1}(\beta_{i}) = \mathrm{logistic}(\beta_{i}) = \frac{1}{1 + \exp(- \beta_{i})} = \frac{\exp(\beta_{i})}{\exp(\beta_{i}) + 1}\)) to the parameter to be able to interpret it on its natural scale (i.e., between 0 and 1). There seems to be a bias toward the “word” responses in the accuracy condition, but not (or less) in the speed condition.
# retrieves the posterior samples for the bias
posterior_intercept_bias <- post$b_bias_Intercept
posterior_slope_bias <- post$b_bias_condition1
# computes the posterior distribution in the speed condition
posterior_bias_speed <- plogis(posterior_intercept_bias - 0.5 * posterior_slope_bias)
# computes the posterior distribution in the accuracy condition
posterior_bias_accuracy <- plogis(posterior_intercept_bias + 0.5 * posterior_slope_bias)
Somehow unsurprisingly, we find that response caution is much higher in the accuracy than in the speed condition, but the same goes for the drift rate and the non-decision time (to a lesser extent).
How do we know that these parameters actually refer to the processes we think they refer to? We check that experimental manipulations that are supposed to only affect some component (rate of information uptake, setting of response criteria, duration of the motor response and bias) effectively do (e.g., Ratcliff, 2002; Ratcliff & Rouder, 1998; Voss et al., 2004)
We can also check parameter values in different groups with known specificities (e.g., age-related slowing in Ratcliff et al., 2000, 2001) or we can try validating the interpretation of these parameters by using additional measures such as electrophysiogical (e.g., EMG, EEG) measures (e.g., Servant et al., 2021; Weindel et al., 2021).
Bayesian inference is a general approach to parameter estimation. This approach uses probability theory to quantify the uncertainty with respect to the value of parameters from statistical models.
These models are composed of different blocks (e.g., likelihood function, priors, linear or non-linear model), which are modifiable as desired. What we usually refer to as “model assumptions” are simply the consequences of modelling choices. In other words, the user defines (and does not suffer) the model’s assumptions.
We have seen that the linear regression model provides a very flexible architecture which makes possible to describe, via the modification of the likelihood function and via the introduction of link functions, complex (e.g., non-linear) relationships between outcomes and predictors. These models can gain in precision by taking into account the variability and structures present in the data (cf. multilevel models).
The brms
package is a real Swiss army knife of Bayesian statistics in R
. It allows you to fit almost any type of regression model. This includes all models that we have seen, but also many others. Among others, multivariate models (i.e., models with several outcomes), “distributional” models (e.g., to predict variance differences), generalized additive models, Gaussian processes (Gaussian processes), models from signal detection theory, mixture models, drift-diffusion models, non-linear models…
Do not hesitate to contact me for more information on these models or if you have questions about your own data. You can also contact the creator of the brms
package, who is very active online (see his site). See also the Stan forum.
Ladislas Nalborczyk - IBSM2023