class: center, middle, inverse, title-slide # Moving to a World beyond p < .05 (and BF > 3) ### Ladislas Nalborczyk ### Aix Marseille Univ, CNRS, LPC, LNC, Marseille, France ### 30.08.2021
@lnalborczyk
Slides available at
tinyurl.com/ilcb2021
--- # Overview 1. Introduction to the philosophy of statistics: Theories, models, evidence, inference + Theoretical and statistical models + Statistical evidence and inference 2. Common misinterpretations of: + P-values and confidence intervals + Bayes factors 3. Problems induced by the mindless use of statistics + False-positive Psychology + The alpha wars 4. Practical recommendations and conclusions --- # Conclusions - Do not say "statistically significant" **The American Statistician** recently published a special issue on *Moving to a World Beyond "p<.05"*, with the intention to provide new recommendations for users of statistics (e.g., researchers, policy makers, journalists). This issue comprises 43 original papers aiming to provide new guidelines and practical alternatives to the "mindless" use of statistics. In the accompanying editorial, [Wasserstein et al. (2019)](https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913) provide a first practical recommendation. -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term "statistically significant" entirely. Nor should variants such as "significantly different", "p < 0.05," and "nonsignificant" survive, whether expressed in words, by asterisks in a table, or in some other way. .tr[ — Wassertein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond "p<0.05". The American Statistician, 73, 1-19. ]] --- # Conclusions - ATOM guidelines Then, they summarise their practical recommendations in the form of the **ATOM** guidelines: - **Accept uncertainty**: we must "countenance uncertainty in all statistical conclusions, seeking ways to quantify, visualize, and interpret the potential for error" ([Calin-Jageman & Cumming, 2019](https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1518266)). -- - **Be thoughtful**: we clearly distinguish between confirmatory (preregistered) and exploratory (non-preregistered) statistical analyses. We routinely evaluate the *validity* of the statistical model and we are suspicious of statistical *defaults*. -- - **Be open**: we try to be exhaustive in the way we report our analyses and we beware of shortcuts than could hinder important information to the reader. -- - **Be modest**: we recognise that there is no unique "true statistical model" and we discuss the limitations of our analyses and conclusions. We also recognise that scientific inference is much broader than statistical inference and we try not to conclude anything from a single study without the warranted uncertainty. --- class: middle, center # Introduction to the philosophy of statistics (why do we need statistics in the first place?) ## Theories, models, evidence, inference --- # Scientific theories A scientific theory can be defined as **a set of logical propositions that posits causal relationships between observable phenomena**. These logical propositions are originally abstract and broad (e.g., "every object responds to the force of gravity in the same way") but lead to concrete and specific predictions that are empirically testable (e.g., "the falling speed of two objects should be the same, all other things being equal"). -- The concept of a scientific theory is not a unitary concept though. As an example, [Meehl (1986)](https://meehl.umn.edu/sites/meehl.umn.edu/files/files/128socscientistsdontunderstand.pdf) lists three kinds of theories: - **Functional-dynamic theories** which relate “states to states or events to events”. For instance, we say that when one variable changes, certain other variables change in such and such ways. - **Structural-compositional theories**: the main idea is to explain what something is composed of, or what kind of parts it has and how they are put together. - **Evolutionary theories** that are about the history and/or development of things (e.g., Darwin's theory, Wegener's theory of continental drift, the fall of Rome, etc.). --- # First problem: We can not confirm theories .pull-left[ According to [Campbell (1990)](https://www.tandfonline.com/doi/abs/10.1207/s15327965pli0102_2), the (intuitive) logical argument of science has the following form: - If Newton's theory A is true, then it should be observed that the tides have period B, the path of Mars shape C, the trajectory of a cannonball form D, etc. - Observation confirms B, C, and D. - Therefore Newton's theory A is "true". However, this argument is a fallacious argument known as the **affirmation of the consequent**. The invalidity comes from the existence of the cross-hatched area, that is, other possible explanations for B, C, and D being observed (figure from [Campbell, 1990](https://www.tandfonline.com/doi/abs/10.1207/s15327965pli0102_2)). ] .pull-right[ <img src="figures/campbell.jpeg" width="100%" style="display: block; margin: auto;" /> ] --- # Second problem: We can not (strictly) falsify theories We can not confirm theories, but maybe we can at least think of a way of disproving them? But what does it mean for a theory to be false? According to Popper's view, a theory can be considered as falsifiable if it can be shown to be false. -- Note that the falsifiability of early Popper concerns the problem of demarcation (i.e., what is science and what is pseudoscience), and defines pseudosciences as composed of non falsifiable theories (i.e., theories that do not allow the possibility of being disproved). -- But when it comes to describe how science works (descriptive purposes) or to know how scientific enquiries should be lead (prescriptive purposes), science is usually not described by the falsification standard, as Popper himself recognised and argued. In fact, deductive falsification is impossible in nearly every scientific context ([McElreath, 2016](https://xcelab.net/rm/statistical-rethinking/)). -- In the next sections, we discuss some of the reasons that prevent any theory to be strictly falsified (in a logical sense), namely: i) the distinction between theoretical and statistical models ii) the problem of measurement iii) the problem of continuous hypotheses, and iv) the Duhem-Quine problem. --- # Theoretical and statistical models A statistical model is a device that connect theories to data. It can be defined as an instantiation of a theory as a set of probabilistic statements ([Rouder, Morey, & Wagenmakers, 2016](https://online.ucpress.edu/collabra/article/2/1/6/112677/The-Interplay-between-Subjectivity-Statistical)). -- <img src="figures/mcelreath.png" width="33%" style="display: block; margin: auto;" /> Theoretical models and statistical models are usually not equivalent as many different theoretical models can correspond to the same probabilistic description. Conversely, different probabilistic descriptions can be derived from the same theoretical model. In other words, there is no one-to-one mapping between the two worlds, which render the induction from the statistical model to the theoretical model quite tricky (figure from [McElreath, 2020](https://xcelab.net/rm/statistical-rethinking/)). --- # Theoretical and statistical inference Causal and inferential relations between substantive theory, statistical hypothesis, and observational data (figure from [Meehl, 1990](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.6444&rep=rep1&type=pdf)). <img src="figures/meehl.png" width="75%" style="display: block; margin: auto;" /> --- # Measurement error The logic of falsification is pretty simple and rests on the power of the modus tollens. This argument (whose exposition, for some reason, usually involves swans) can be presented as follows: - If my theory `\(T\)` is right then I should observe these data `\(D\)` - I observe data that are not those I predicted `\(\neg D\)` - Therefore, my theory is wrong `\(\neg T\)` -- This argument is perfectly valid and works well for logical statements (statements that are either true or false). However, the first problem that arises when we try to apply this reasoning to the "real world" is the problem of observation error: observations are prone to error, especially at the boundaries of knowledge ([McElreath, 2016](https://xcelab.net/rm/statistical-rethinking/)). --- # Measurement error .pull-left[ According to Einstein, neutrinos can not travel faster than the speed of light. Thus, any observation of faster-than-light neutrinos would act as a strong falsifier of Einstein's special relativity. In 2011 however, a large team of respected physicists announced the detection of faster-than-light neutrinos (see the Wikipedia article: https://en.wikipedia.org/wiki/Faster-than-light_neutrino_anomaly). What was the reaction of the scientific community? The dominant reaction was not to claim Einstein's theory to be falsified but was instead: "How did this team mess up the measurement?" ([McElreath, 2016](https://xcelab.net/rm/statistical-rethinking/)). ] .pull-right[ <img src="figures/cable.png" width="80%" style="display: block; margin: auto;" /> ] --- # Probabilistic hypotheses Another problem arises from a misapplication of deductive syllogistic reasoning (a misapplication of the modus tollens). The problem (the "permanent illusion", as put by [Gigerenzer, 1993](https://media.pluto.psy.uconn.edu/Gigerenzer%20superego%20ego%20id.pdf)) is that most scientific hypotheses are not really of the kind "all swans are white" but rather of the form: - Ninety percent of swans are white. - If my hypothesis is correct, we should probably not observe a black swan. -- Given this hypothesis, what can we conclude if we observe a black swan? Not much. To understand why, let's translate it first to a more common statement in psychological research (from [Cohen, 1994](http://www.iro.umontreal.ca/~dift3913/cours/papers/cohen1994_The_earth_is_round.pdf)): - If the null hypothesis is true, then these data are highly unlikely. - These data have occurred. - Therefore, the null hypothesis is highly unlikely. But because of the probabilistic premise (i.e., the "highly unlikely") this conclusion is invalid. Why? --- # Probabilistic hypotheses Consider the following argument (still from [Cohen, 1994](http://www.iro.umontreal.ca/~dift3913/cours/papers/cohen1994_The_earth_is_round.pdf), borrowed from [Pollard & Richardson, 1987](https://psycnet.apa.org/record/1987-30223-001)): - If a person is an American, he is probably not a member of Congress. - This person is a member of Congress. - Therefore, he is probably not an American. This conclusion is not sensible (the argument is invalid), because it fails to consider the alternative to the premise, which is that if this person were not an American, the probability of being a member of Congress would be 0. -- This is formally exactly the same as: - If the null hypothesis is true, then these data are highly unlikely. - These data have occurred. - Therefore, the null hypothesis is highly unlikely. Which is as much invalid as the previous argument, because i) the premise (the hypothesis) is probabilistic/continuous rather than discrete/logical and ii) because it fails to consider the probability of the alternative. Thus, even without measurement/observation error, this problem would prevent us from applying the modus tollens to our hypothesis, thus preventing any possibility of strict falsification. --- # The underdetermination problem Again another problem is known as the [Duhem–Quine thesis/problem](https://en.wikipedia.org/wiki/Duhem–Quine_thesis) (aka the *underdetermination problem*). In practice, when a substantive theory `\(T\)` happens to be tested, some hidden assumptions, such as auxiliary theories about the instruments we use, are also put under examination ([Meehl, 1978](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.200.7648&rep=rep1&type=pdf); [1990](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.6444&rep=rep1&type=pdf); [1997](https://meehl.umn.edu/sites/meehl.umn.edu/files/files/169problemisepistemology.pdf)). -- When we test a theory predicting that "if `\(O_{1}\)`" (some manipulation), "then `\(O_{2}\)`" (some observation), what we actually mean is that we should observe this relation, **if and only if** all of the above (i.e., the auxiliary theories, the instrument theories, the particulars, etc.) are true. ??? These involve auxiliary theories that help to connect the substantive theory with the "real world", in order to make testable predictions (e.g., "both white and black swans go walk around a similar proportion of time, so that we are equally likely to observe them in the nature"). It also usually involves some auxiliary theories about the instruments we use (e.g., "the BDI is a valid instrument for measuring depressive symptoms"), and the empirical realisation of specific conditions describing the experimental *particulars* ([Meehl, 1978](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.200.7648&rep=rep1&type=pdf); [1990](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.6444&rep=rep1&type=pdf); [1997](https://meehl.umn.edu/sites/meehl.umn.edu/files/files/169problemisepistemology.pdf)). --- # The underdetermination problem Thus, the logical structure of an empirical test of a theory `\(T\)` can be described as the following conceptual formula ([Meehl, 1978](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.200.7648&rep=rep1&type=pdf); [1990](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.6444&rep=rep1&type=pdf); [1997](https://meehl.umn.edu/sites/meehl.umn.edu/files/files/169problemisepistemology.pdf)): `$$(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}) \to (O_{1} \supset O_{2})$$` where the `\(\land\)` are conjunctions ("and"), the arrow `\(\to\)` denotes deduction ("follows that ..."), and the horseshoe `\(\supset\)` is the material conditional ("If `\(O_{1}\)`, Then `\(O_{2}\)`"). `\(A_{t}\)` is a conjunction of auxiliary theories, `\(Cp\)` is a *ceteribus paribus* clause (i.e., we assume there is no other factor exerting an appreciable influence that could obfuscate the main effect of interest), `\(A_{i}\)` is an auxiliary theory regarding instruments, and `\(C_{n}\)` is a statement about experimentally realised conditions (i.e., we assume that there is no systematic error/noise in the experimental settings). -- In other words, we imply that a conjunction of all the elements on the left-side (including our substantive theory `\(T\)`) does imply the right side of the arrow, that is, "if `\(O_{1}\)`, then `\(O_{2}\)`". The falsificationist attitude of the modern psychologist would lead her to think that not observing this relation would falsify the substantive theory of interest, based on the valid fourth figure of the implicative syllogism (the modus tollens). --- # The underdetermination problem `$$(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}) \to (O_{1} \supset O_{2})$$` However, although the modus tollens is a valid figure of the implicative syllogism for logical statements (e.g., "all swans are black"), the neatness of Popper's classic falsifiability concept is fuzzed up by the acknowledgement of the actual form of an empirical test. Obtaining falsificative evidence during an empirical test does not only falsify the substantive theory `\(T\)`, but it does falsify all the left-side of the above statement. In other words, what we have achieved by our laboratory or correlational "falsification" is a falsification of the combined claims `\(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}\)`, which is probably not what we had in mind when we did the experiment ([Meehl, 1990](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.6444&rep=rep1&type=pdf)). -- To sum up, failing to observe a predicted outcome does not necessarily mean that the theory itself is wrong, but rather that the conjunction of the theory and the underlying assumptions at hand are invalid ([Lakatos, 1978](https://www.cambridge.org/core/books/methodology-of-scientific-research-programmes/8DBCEFE34A59BAD3D393FB958A4DC5FC); [Meehl, 1978](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.200.7648&rep=rep1&type=pdf); [1990](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.6444&rep=rep1&type=pdf)). --- # Consequences Falsification in science is almost always consensual, not logical ([McElreath, 2020](https://xcelab.net/rm/statistical-rethinking/)). A theoretical claim is considered to be falsified only when multiple lines of converging evidence have been obtained, by independent teams of researchers, and usually after several years or decades of critical discussion. The "falsification of a theory" then appears as a social result, issued from the community of scientists, and (almost) never as a deductive falsification. -- How can we accumulate **evidence** in favour of or against a theory? --- class: middle, center # Common misinterpretations of p-values and confidence intervals --- # Null Hypothesis Significance Testing (NHST) Let's say we are interested in height differences between women and men... ```r men <- rnorm(n = 100, mean = 174, sd = 10) # 100 men heights women <- rnorm(n = 100, mean = 170, sd = 10) # 100 women heights ``` -- ```r t.test(x = men, y = women) ``` ``` ## ## Welch Two Sample t-test ## ## data: men and women ## t = 2.7805, df = 196.13, p-value = 0.005956 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.238872 7.283495 ## sample estimates: ## mean of x mean of y ## 173.3333 169.0721 ``` <!-- --- --> <!-- # Interpreting the p-value --> <!-- From [Greenland et al. (2016)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/pdf/10654_2016_Article_149.pdf) and [Goodman (2008)](https://www.ohri.ca/newsroom/seminars/SeminarUploads/1829%5CSuggested%20Reading%20-%20Nov%203,%202014.pdf). --> <!-- - The p-value is the probability that the null hypothesis is true. For example, if a test of the null hypothesis gave `\(p = 0.01\)`, the null hypothesis has only a 1% chance of being true. --> <!-- -- --> <!-- - The p-value for the null hypothesis is the probability that chance alone produced the observed association. For example, if the p-value for the null hypothesis is 0.08, there is an 8% probability that chance alone produced the association. --> <!-- -- --> <!-- - The p-value is the chance of our data occurring if the null hypothesis is true. For example, `\(p = 0.05\)` means that the observed association would occur only 5% of the time under the null hypothesis. --> --- <iframe src="https://embed.polleverywhere.com/multiple_choice_polls/ok3TmpBIp0fmbxGJDS7rr?controls=none&short_poll=true" width="1200px" height="600px"></iframe> --- class: middle, center ## None of these definitions is true... ## 🥺🥺🥺🥺 --- # Null Hypothesis Significance Testing (NHST) We are going to simulate t-values computed on samples generated under the assumption of no difference between women and men (the null hypothesis H0). ```r nsims <- 1e4 # number of simulations t <- rep(x = NA, times = nsims) # initialising an empty vector for (i in 1:nsims) { men2 <- rnorm(n = 100, mean = 170, sd = 10) women2 <- rnorm(n = 100, mean = 170, sd = 10) t[i] <- t.test(x = men2, y = women2)$statistic } ``` -- Or without for loops. ```r t <- replicate(n = nsims, expr = t.test(x = rnorm(100, 170, 10), y = rnorm(100, 170, 10) )$statistic) ``` --- # Null Hypothesis Significance Testing (NHST) ```r data.frame(t = t) %>% ggplot(aes(x = t) ) + geom_histogram() + theme_bw(base_size = 20) ``` <img src="ilcb_summer_school_files/figure-html/unnamed-chunk-9-1.svg" width="40%" style="display: block; margin: auto;" /> --- # Null Hypothesis Significance Testing (NHST) ```r data.frame(t = c(-5, 5) ) %>% ggplot(aes(x = t) ) + stat_function(fun = dt, args = list(df = t.test(men, women)$parameter), size = 1.5) + theme_bw(base_size = 20) + ylab("Probability density") ``` <img src="ilcb_summer_school_files/figure-html/unnamed-chunk-10-1.svg" width="40%" style="display: block; margin: auto;" /> --- # Null Hypothesis Significance Testing (NHST) ```r alpha <- .05 abs(qt(alpha / 2, df = t.test(x = men, y = women)$parameter) ) # two-sided critical t-value ``` ``` ## [1] 1.972133 ``` <img src="ilcb_summer_school_files/figure-html/unnamed-chunk-12-1.svg" width="40%" style="display: block; margin: auto;" /> --- # Null Hypothesis Significance Testing (NHST) ```r tobs <- t.test(x = men, y = women)$statistic # observed t-value tobs %>% as.numeric ``` ``` ## [1] 2.780528 ``` <img src="ilcb_summer_school_files/figure-html/unnamed-chunk-14-1.svg" width="40%" style="display: block; margin: auto;" /> --- # P-values A p-value is simply a tail area (an integral) computed from the distribution of test statistics under (given) the null hypothesis. It gives the probability of observing the data we observed *or more extreme data*, **given that the null hypothesis is true** ([Wagenmakers et al., 2007](https://link.springer.com/article/10.3758/BF03194105)). `$$p[\mathbf{t}(\mathbf{x}^{\text{rep}}|H_{0}) \geq t(x)]$$` -- ```r t.test(x = men, y = women)$p.value ``` ``` ## [1] 0.005955894 ``` -- ```r tvalue <- abs(t.test(x = men, y = women)$statistic) df <- t.test(x = men, y = women)$parameter 2 * integrate(f = dt, lower = tvalue, upper = Inf, df = df)$value ``` ``` ## [1] 0.005955896 ``` --- # Fisher versus Neyman & Pearson <img src="figures/fisher.jpeg" width="33%" style="display: block; margin: auto;" /> .pull-left[ A Fisherian p-value is thought to measure the strength of evidence against the null hypothesis, the lower the p-value, the stronger the evidence against the null hypothesis. But we know that p-values at best *correlate* (in a loose meaning) with evidence (e.g., see [Wagenmakers, 2007](http://www.ejwagenmakers.com/2007/pValueProblems.pdf)). The Fisherian continuous interpretation of p-values has many problems (cf. next slide) and has been widely criticised. ] -- .pull-right[ Neyman & Pearson used p-values and significance thresholds as a way of **controlling error rates in the long run**. In this perspective, we don't interpret the p-value, we only "classify" results as *significant* or *non-significant*. This strict procedure allows keeping error rates at a fixed level (given that the null hypothesis is true, see this [blogpost](https://lucklab.ucdavis.edu/blog/2018/4/19/why-i-lost-faith-in-p-values)). However, this view also has serious problems. One of the biggest problem being the *domain problem* (see [Trafimow & Earp, 2017](https://www.sciencedirect.com/science/article/pii/S0732118X16301076?via%3Dihub)). ] --- # Logic, frequentism, and probabilistic reasoning The modus tollens is one of the strongest rule of inference in logic. It works perfectly well in science when we deal with hypotheses of the following form: *If `\(H_{0}\)` is true, then we should not observe `\(x\)`. We observed `\(x\)`. Then, `\(H_{0}\)` is false*. -- BUT, most of the time, we deal with *continuous*, *probabilistic* hypotheses... The Fisherian inference (induction) is of the form: *If `\(H_{0}\)` is true, then we should PROBABLY not observe `\(x\)`. We observed `\(x\)`. Then, `\(H_{0}\)` is PROBABLY false*. -- However, this argument is invalid. The modus tollens does not apply to probabilistic statements (e.g., [Pollard & Richardson, 1987](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.505.9968&rep=rep1&type=pdf); [Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016](http://www.ejwagenmakers.com/2016/RouderEtAl2016FreeLunch.pdf)). --- # Interpreting confidence intervals Confidence intervals are basically regions of significance. Thus, they have to be interpreted as cautiously as p-values, and are submitted to the same flaws. A 95% confidence interval **does not mean** that there is a 95% probability that the interval contains the population value of the parameter (remember the *modus tollens* fallacy). -- The only correct interpretation is to think about it in terms of *coverage proportion* (see next slide and [this blogpost](http://rpsychologist.com/d3/CI/)). **A 95% confidence interval represents a statement about the procedure**, not about the parameter. It means that, in the long run, 95% of the confidence intervals we could compute (in an exact replication of the experiment) would contain the population value of the parameter. But we can not say anything about the particular confidence interval we computed in this particular experiment... --- <div align="center"><iframe width="1200" height="600" src="http://rpsychologist.com/d3/CI/" scrolling="yes"></iframe></div> --- # Preliminary summary Frequentist statistics (e.g., p-values and confidence intervals) make sense under the frequentist interpretation of probability: they refer to **long-run frequencies**. -- P-values are simply tail areas in probability distributions. It means that they are conditional on some distribution. But it also means that computing a p-value is a generic statistical procedure, it's not inextricable from the null hypothesis (e.g., see [Bayesian p-values](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.310.145&rep=rep1&type=pdf)). -- Confidence intervals are basically regions of significance. Thus, they are prone to the very same limits as p-values. --- class: middle, center # Common misinterpretations of Bayes factors --- # Bayes factors Instead of testing only one hypothesis (the null hypothesis), Bayes factors allow comparing two hypotheses. For instance, let's say we are comparing two models: - `\(H_{0}: \mu_{1} = \mu_{2} \rightarrow \delta = 0\)` - `\(H_{1}: \mu_{1} \neq \mu_{2} \rightarrow \delta \neq 0\)` -- `$$\underbrace{\dfrac{p(H_{0}|D)}{p(H_{1}|D)}}_{posterior\ odds} = \underbrace{\dfrac{p(D|H_{0})}{p(D|H_{1})}}_{Bayes\ factor} \times \underbrace{\dfrac{p(H_{0})}{p(H_{1})}}_{prior\ odds}$$` -- `$$\text{evidence}\ = p(D|H) = \int p(\theta|H) p(D|\theta, H) \text{d}\theta$$` The *evidence* in favour of a model corresponds to the *marginal likelihood* of a model. In other words, it is an averaged *likelihood* weighted by the prior predictions of the model, which makes the Bayes factor a kind of Bayesian likelihood ratio. --- # What does a Bayes factor look like? Let's say we want to estimate the bias `\(\theta\)` of a coin. For convenience, we can write our predictions as two [beta-binomial models](http://www.barelysignificant.com/post/ppc/): $$ `\begin{align} \mathcal{M_{1}} : y_{i} &\sim \mathrm{Binomial}(n, \theta) \\ \theta &\sim \mathrm{Beta}(6, 10) \\ \end{align}` $$ $$ `\begin{align} \mathcal{M_{2}} : y_{i} &\sim \mathrm{Binomial}(n, \theta) \\ \theta &\sim \mathrm{Beta}(20, 12) \\ \end{align}` $$ <img src="ilcb_summer_school_files/figure-html/unnamed-chunk-18-1.svg" width="60%" style="display: block; margin: auto;" /> --- # What does a Bayes factor look like? <img src="figures/bf.gif" width="50%" style="display: block; margin: auto;" /> --- # Bayes factors are the new p-values... Be careful to not interpret Bayes factors as *posterior odds*... Bayes factors indicate how much we should update our *prior odds*, in the light of new incoming data. They **do not tell us what is the most probable hypothesis**, given the data (unless the prior odds are 1:1). -- Let's take another example: - `\(H_{0}\)`: there is no such thing as precognition - `\(H_{1}\)`: precognition does really exist We run an experiment and observe a `\(BF_{10} = 27\)`. What is the posterior probability of H1? -- `$$\underbrace{\dfrac{p(H_{1}|D)}{p(H_{0}|D)}}_{posterior\ odds} = \underbrace{\dfrac{27}{1}}_{Bayes\ factor} \times \underbrace{\dfrac{1}{1000}}_{prior\ odds} = \dfrac{27}{1000} = 0.027$$` --- class: middle, center # Problems induced by the mindless use of statistics ## False-positive Psychology --- # False-positive Psychology .pull-left[ In Psychology, the origin of the reproducibility crisis is often linked to [Daryl Bem's (2011)](https://content.apa.org/doiLanding?doi=10.1037%2Fa0021524) paper, which reported empirical evidence for the existence of "psi", otherwise known as Extra Sensory Perception. This paper passed through the standard peer review process and was published in the high impact Journal of Personality and Social Psychology (cf. https://plato.stanford.edu/entries/scientific-reproducibility/). ] .pull-right[ <img src="figures/bem.png" width="100%" style="display: block; margin: auto;" /> ] --- # False-positive Psychology .pull-left[ <img src="figures/simmons1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figures/simmons2.png" width="100%" style="display: block; margin: auto;" /> ] --- # Questionable research practices (QRPs) <img src="figures/qrp.png" width="60%" style="display: block; margin: auto;" /> Undisclosed flexibility in data collection, analysis, and interpretation dramatically increases the false positive rates (cf. the "garden of forking paths" from [Gelman & Loken, 2013](http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf)). Very easy to publish significant results consistent with any hypothesis. --- # Consequences of questionable research practices <img src="figures/replicability.png" width="80%" style="display: block; margin: auto;" /> --- class: middle, center # Problems induced by the mindless use of statistics ## The alpha wars --- # Lowering the bar? .pull-left[ [Benjamin et al. (2017)](https://www.nature.com/articles/s41562-017-0189-z) proposed to "change the default p-value threshold for statistical significance from .05 to .005 for claims of new discoveries". The key argument is that (when analysed from a "Bayesian" perspective), a p-value of .05 tends to provide very little evidence (given certain decisions about priors). They further claim that reducing the criteria for "evidence" to p < .005 will "immediately improve the reproducibility of scientific research in many fields". ] .pull-right[ <img src="figures/redefine.png" width="100%" style="display: block; margin: auto;" /> ] --- # Lowering the bar? .pull-left[ A two-sided p-value of .05 corresponds to Bayes factors in favour of H1 that range from about 2.5 to 3.4 (for all scenarios). But a two-sided p-value of .005 corresponds to Bayes factors in favour of H1 that range from about 13.9 to 25.7. ] .pull-right[ <img src="figures/redefine1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Lowering the bar? .pull-left[ "[...] the false positive rate is greater than 33% with prior odds of 1:10 and a p-value threshold of 0.05, regardless of the level of statistical power. Reducing the threshold to 0.005 would reduce this minimum false positive rate to 5%. Similar reductions in false positive rates would occur over a wide range of statistical powers." ] .pull-right[ <img src="figures/redefine2.png" width="100%" style="display: block; margin: auto;" /> ] --- # Some responses A lot of commentaries since (NB: this list is far from exhaustive) - Justify Your Alpha (Lakens et 90+ al.) - Abandon Statistical Significance (McShane, Gal, Gelman, Robert, & Tackett) - Manipulating the Alpha Level Cannot Cure Significance Testing (Trafimow et 50+ al.) - Remove, rather than redefine, statistical significance (Amrhein & Greenland) - Why "Redefining Statistical Significance" Will Not Improve Reproducibility and Could Make the Replication Crisis Worse (Crane) - When the statistical tail wags the scientific dog (Richard Morey), three posts published on Medium --- # Some responses - There is "no sufficient evidence that the current standard is leading the reproducibility crisis" (Lakens et al.) -- - Simply changing the threshold keeps encouraging naive statistical thinking and over-reliance on a single statistical index (e.g., Lakens et al., Amrhein & Greenland, McShane et al., Trafimow et al.) -- - Changing the threshold from .05 to .005 might aggravate biases of (our use of) NHST, e.g., dichotomous thinking, QRPs, exaggeration of effect sizes, publication biases, low power... (e.g., Amrhein & Greenland, Lakens et al., Trafimow et al., Crane) -- - The difference between new discoveries and usual findings is blurry... how to judge a replication success then? Is the order of studies important? (McShane et al., Trafimow et al.) -- - "The new significance threshold will help researchers and readers to understand and communicate evidence more accurately." But if researchers have understanding problems with a .05 threshold, it is unclear how using a .005 threshold will eliminate these problems. (Trafimow et al.) --- # Considered recommendations - Abandoning the term "significant" (but not NHST), as well as statistical standards (Lakens et al.) -- - Entirely abandoning (the focus on) NHST (Amrhein & Greenland, McShane et al., Trafimow et al.) -- - Using alternatives to NHST, e.g., likelihood inference (see Royall, 1997), fully Bayesian analyses (see Gelman et al., 2013), a priori inferential statistics (see Trafimow, 2017; Trafimow & MacDonald, 2017) --- # Consensual recommendations - Abandoning the term significant, avoiding dichotomous thinking, embracing uncertainty -- - "Reliable scientific conclusions require information to be combined from multiple studies and lines of evidence" (Amrhein & Greenland, 2017) -- - All design or modeling choices should be transparently reported and justified (e.g., the alpha level, the threshold for BFs, the choice of priors in Bayesian analyses, the statistical modelling strategy) --- class: middle, center # Practical recommendations and conclusions --- # Do not say "statistically significant" **The American Statistician** recently published a special issue on *Moving to a World Beyond "p<.05"*, with the intention to provide new recommendations for users of statistics (e.g., researchers, policy makers, journalists). This issue comprises 43 original papers aiming to provide new guidelines and practical alternatives to the "mindless" use of statistics. In the accompanying editorial, [Wasserstein et al. (2019)](https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913) provide a first practical recommendation. -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term "statistically significant" entirely. Nor should variants such as "significantly different", "p < 0.05," and "nonsignificant" survive, whether expressed in words, by asterisks in a table, or in some other way. .tr[ — Wassertein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond "p<0.05". The American Statistician, 73, 1-19. ]] --- # ATOM guidelines Then, they summarise their practical recommendations in the form of the **ATOM** guidelines: - **Accept uncertainty**: we must "countenance uncertainty in all statistical conclusions, seeking ways to quantify, visualize, and interpret the potential for error" ([Calin-Jageman & Cumming, 2019](https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1518266)). -- - **Be thoughtful**: we clearly distinguish between confirmatory (preregistered) and exploratory (non-preregistered) statistical analyses. We routinely evaluate the *validity* of the statistical model and we are suspicious of statistical *defaults*. -- - **Be open**: we try to be exhaustive in the way we report our analyses and we beware of shortcuts than could hinder important information to the reader. -- - **Be modest**: we recognise that there is no unique "true statistical model" and we discuss the limitations of our analyses and conclusions. We also recognise that scientific inference is much broader than statistical inference and we try not to conclude anything from a single study without the warranted uncertainty. --- # Further resources The special issue: https://www.tandfonline.com/toc/utas20/73/sup1 Introduction to the Meehlian Corroboration-Verisimilitude theory of science: https://www.barelysignificant.com/post/corroboration1/ and https://www.barelysignificant.com/post/corroboration2/ A full course on Bayesian statistical (thinking and) modelling: McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second Edition. CRC Press. Everything is fucked: The syllabus, https://hardsci.wordpress.com/2016/08/11/everything-is-fucked-the-syllabus/ Some examples of ATOMised reporting of statistical modelling (from my own work): https://pubs.asha.org/doi/abs/10.1044/2018_JSLHR-S-18-0006, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0233282, https://journals.sagepub.com/doi/abs/10.1177/0956797619900336 The materials of my doctoral course on Bayesian statistical modelling (in French): https://github.com/lnalborczyk/IMSB2020 --- # Take-home messages ### Don'ts - Do not say "statistically significant". - Do not dichotomise or trichotomise statistical results. -- ### Dos - Accept uncertainty. Be thoughtful, open, and modest. - Transition from mindless statistics to statistical thinking. - Read, digest, and teach some philosophy of statistics. <br> <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [lnalborczyk](https://twitter.com/lnalborczyk) <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> [lnalborczyk](https://github.com/lnalborczyk) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <g label="icon" id="layer6" groupmode="layer"> <path id="path2" d="m 255.9997,7.9999987 c -34.36057,0 -62.21509,27.8545563 -62.21509,62.2151643 0,20.303056 9.87066,38.160947 24.91769,49.517247 0.18814,-20.457899 16.79601,-36.993393 37.29685,-36.993393 20.50082,0 37.11091,16.535494 37.29909,36.993393 15.04533,-11.3563 24.9177,-29.212506 24.9177,-49.517247 C 318.21272,35.854555 290.35915,7.9999987 255.99915,7.9999987 Z M 293.29654,392.2676 c -0.18814,20.4601 -16.79601,36.99338 -37.29684,36.99338 -20.50082,0 -37.10922,-16.53551 -37.29684,-36.99338 -15.04759,11.35627 -24.91769,29.21246 -24.91769,49.51722 0,34.36059 27.85453,62.21518 62.2151,62.21518 34.36056,0 62.21508,-27.85459 62.21508,-62.21518 0,-20.30306 -9.87066,-38.16095 -24.91767,-49.51722 z M 441.78489,193.78484 c -20.30301,0 -38.16309,9.87068 -49.51717,24.91769 20.45786,0.18819 36.99333,16.79605 36.99333,37.29689 0,20.50085 -16.53547,37.11096 -36.9911,37.29916 11.35634,15.04533 29.21249,24.91769 49.51721,24.91769 C 476.14549,318.21327 504,290.35948 504,255.99942 504,221.6394 476.14549,193.78425 441.78489,193.78425 Z M 82.738898,255.99997 c 0,-20.50139 16.535509,-37.11096 36.993392,-37.29689 -11.35632,-15.04756 -29.214164,-24.91769 -49.517197,-24.91769 -34.36057,0 -62.2150945,27.85455 -62.2150945,62.21517 0,34.3606 27.8545245,62.21516 62.2150945,62.21516 20.303033,0 38.160877,-9.87068 49.517197,-24.91773 -20.457883,-0.18818 -36.993391,-16.796 -36.993391,-37.29688 z M 431.3627,80.636814 c -24.29549,-24.295544 -63.68834,-24.295544 -87.9844,0 -14.35704,14.357057 -20.00298,33.963346 -17.39331,52.633806 -0.0824,0.0809 -0.18198,0.13437 -0.26434,0.21491 -14.578,14.57799 -14.578,38.21689 0,52.79488 14.57797,14.57799 38.21681,14.57799 52.79484,0 0.0824,-0.0824 0.13455,-0.18198 0.21732,-0.26434 18.66819,2.60796 38.27445,-3.03799 52.63151,-17.39336 24.29378,-24.29778 24.29378,-63.68837 -0.003,-87.986153 z M 186.2806,378.51178 c 14.57798,-14.57799 14.57798,-38.21461 0,-52.79319 -14.57798,-14.57853 -38.21683,-14.57798 -52.79481,0 -0.0825,0.0824 -0.13448,0.18199 -0.21476,0.26215 -18.67046,-2.60795 -38.276723,3.03572 -52.63376,17.39505 -24.297753,24.29552 -24.297753,63.6884 0,87.98449 24.29551,24.29552 63.68833,24.29552 87.98439,0 14.35702,-14.35703 20.00297,-33.96333 17.39333,-52.63386 0.0848,-0.0786 0.18364,-0.13228 0.26672,-0.21505 z m 0,-245.02583 c -0.0826,-0.0824 -0.18198,-0.13436 -0.26445,-0.21494 2.60795,-18.66823 -3.038,-38.27452 -17.39332,-52.633811 -24.29777,-24.295544 -63.68832,-24.295544 -87.984405,0 -24.297747,24.297781 -24.297747,63.688381 0,87.986151 14.357042,14.35706 33.963315,20.00301 52.631515,17.39336 0.0808,0.0824 0.13447,0.18199 0.21475,0.26434 14.57799,14.57799 38.21684,14.57799 52.79482,0 14.57797,-14.57802 14.57797,-38.21689 0,-52.79488 z m 245.0821,209.89048 c -14.35703,-14.35703 -33.96329,-20.00301 -52.63378,-17.39505 -0.0809,-0.0824 -0.13228,-0.18199 -0.21506,-0.26215 -14.57797,-14.57799 -38.21685,-14.57799 -52.79482,0 -14.57797,14.57799 -14.57797,38.21461 0,52.79316 0.0827,0.0828 0.18198,0.13455 0.26434,0.21505 -2.60796,18.67053 3.03802,38.27683 17.39334,52.63386 24.29552,24.29552 63.68834,24.29552 87.98439,0 24.29775,-24.29552 24.29775,-63.68841 0.003,-87.98451 z" style="stroke-width:0.07717"></path> </g></svg> [https://osf.io/ba8xt](https://osf.io/ba8xt) <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg> [www.barelysignificant.com](https://www.barelysignificant.com)