Informed Bayesian Inference for the A/B Test
Quentin F. Gronau, K. N. Akash Raj, and Eric-Jan Wagenmakers

TL;DR
This paper introduces a Bayesian A/B testing method that allows for continuous evidence monitoring, incorporation of expert prior knowledge, and assessment of null effects, addressing limitations of existing approaches.
Contribution
It presents a Bayesian A/B testing procedure based on Kass and Vaidyanathan (1992) that supports evidence monitoring, null hypothesis evaluation, and prior knowledge integration.
Findings
Supports evidence monitoring during data collection
Allows explicit evaluation of null hypothesis
Incorporates expert prior knowledge
Abstract
Booming in business and a staple analysis in medical trials, the A/B test assesses the effect of an intervention or treatment by comparing its success rate with that of a control condition. Across many practical applications, it is desirable that (1) evidence can be obtained in favor of the null hypothesis that the treatment is ineffective; (2) evidence can be monitored as the data accumulate; (3) expert prior knowledge can be taken into account. Most existing approaches do not fulfill these desiderata. Here we describe a Bayesian A/B procedure based on Kass and Vaidyanathan (1992) that allows one to monitor the evidence for the hypotheses that the treatment has either a positive effect, a negative effect, or, crucially, no effect. Furthermore, this approach enables one to incorporate expert knowledge about the relative prior plausibility of the rival hypotheses and about the expected…
| Test | |||||
|---|---|---|---|---|---|
| Hypothesis | Default | Undirected | Positive | Negative | Direction |
| .50 | .50 | .50 | .50 | 0 | |
| 0 | .50 | 0 | 0 | 0 | |
| .25 | 0 | .50 | 0 | .50 | |
| .25 | 0 | 0 | .50 | .50 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Informed Bayesian Inference for the A/B Test
Quentin F. Gronau
University of Amsterdam &Akash Raj K. N.
University of Amsterdam &Eric-Jan Wagenmakers
University of Amsterdam
\Plainauthor
Quentin F. Gronau, Akash Raj K. N., Eric-Jan Wagenmakers \PlaintitleInformed Bayesian Inference for the A/B Test \Shorttitle\pkgabtest \AbstractBooming in business and a staple analysis in medical trials, the A/B test assesses the effect of an intervention or treatment by comparing its success rate with that of a control condition. Across many practical applications, it is desirable that (1) evidence can be obtained in favor of the null hypothesis that the treatment is ineffective; (2) evidence can be monitored as the data accumulate; (3) expert prior knowledge can be taken into account. Most existing approaches do not fulfill these desiderata. Here we describe a Bayesian A/B procedure based on Kass and Vaidyanathan (1992) that allows one to monitor the evidence for the hypotheses that the treatment has either a positive effect, a negative effect, or, crucially, no effect. Furthermore, this approach enables one to incorporate expert knowledge about the relative prior plausibility of the rival hypotheses and about the expected size of the effect, given that it is non-zero. To facilitate the wider adoption of this Bayesian procedure we developed the \pkgabtest package in \proglangR. We illustrate the package options and the associated statistical results with a fictitious business example and a real data medical example. \Keywordsmodel comparison, Bayes factor, prior elicitation, Bayesian estimation \Plainkeywordsmodel comparison, Bayes factor, prior elicitation, Bayesian estimation \Address Quentin F. Gronau
Department of Psychological Methods
University of Amsterdam
Nieuwe Achtergracht 129 B
1018 WT Amsterdam, The Netherlands
E-mail:
1 Introduction
Does the modification of a company website increase the number of online purchases? Does a new drug result in a lower mortality rate? These are just two examples of the kinds of questions that can be addressed with A/B testing, a procedure popular not only in business and medical clinical trials, but also in fields such as psychology, neuroscience, and biology. The A/B test set-up discussed in this article assumes that the outcome variable is binary; nevertheless, the outcome variable could in principle also be continuous. Based on a binary outcome variable, an A/B test compares the success rate of two options or treatment arms, A and B, and therefore can be conceptualized as a test for a difference between two proportions (Little, 1989). Typically, options A and B correspond to a control condition and an intervention or treatment of interest.
Regardless of the specific field of application, we believe three general desiderata for A/B tests can be identified. First, we believe it is desirable that evidence can be obtained in favor of the null hypothesis that there is no difference between options A and B. For instance, suppose a programmer alters code that should leave the appearance of a website unaffected. An A/B test may be conducted to confirm that the code changes did not lead to unintended consequences. Alternatively, suppose that a cheaper drug is introduced as a replacement of the standard drug; here, an A/B test may confirm that the cheaper drug is as effective as the drug that is currently standard.
Second, we believe it is desirable that evidence can be monitored as the data accumulate. Data collection can be time-consuming and expensive, and interim tests allow one to assess whether the results in hand are already sufficiently compelling or whether additional data ought to be obtained. There is also an ethical aspect to this desideratum, one that is particularly pronounced in case of new clinical treatments that are potentially beneficial or harmful; it is unethical to withhold treatment that interim analysis shows to be beneficial, just as it is unethical to continue to administer a treatment that interim analysis shows to be harmful (e.g., Armitage, 1960; see also Ware, 1989 and the accompanying discussion).
Third, we believe it is desirable that expert knowledge can be taken into account (e.g., O’Hagan, 2019). In many A/B testing applications, there exists considerable expert knowledge about what size of effect to expect. For instance, the effect of website changes on conversion rates is often less than 0.5% (Berman et al., 2018). Incorporating such expert knowledge into the statistical analysis will yield a more targeted test.
The majority of A/B testing procedures that are currently in vogue do not fulfill the above desiderata. Specifically, many companies apply standard -value-based null hypothesis significance testing to assess whether or not options A and B differ. This procedure has the advantage that it is readily available in software such as \proglangR (\proglangR Core Team, 2019, e.g., via the functions \codeprop.test, \codefisher.test, and \codechisq.test). However, this approach cannot distinguish between absence of evidence (i.e., the data are inconclusive) and evidence of absence (i.e., the data provide support for the null hypothesis that options A and B do not differ; Dienes, 2014; Keysers et al., 2020, e.g.,). Furthermore, although common practice, sequentially monitoring the uncorrected -value (and stopping data collection as soon as the -value is smaller than some fixed -level) invalidates the analysis (e.g., Feller, 1940). However, there exist valid classical sequential procedures that enable one to monitor a corrected -value as data accumulate (e.g., Malek et al., 2017). For instance, Optimizely, one of the leading commercial A/B testing platforms, has recently implemented an alternative -value-based approach that allows users to continuously monitor the test outcome (Johari et al., 2017). Nevertheless, these sequential -value-based procedures retain the inability to quantify evidence for the absence of an effect. Furthermore, (sequential) -value-based A/B testing does not allow one to incorporate expert knowledge into the statistical analysis in a straightforward manner.
An alternative A/B testing approach that has become more popular of late is Bayesian estimation. For instance, VWO, another leading A/B testing platform, has recently implemented a Bayesian estimation approach (Stucchio, 2015). A Bayesian estimation approach is also available via the \pkgBayesianFirstAid package (Bååth, 2014) and the \pkgbayesAB package (Portman, 2019).111The \pkgbayesAB package provides a range of functions for Bayesian A/B testing. One advantage is that users can choose from a range of different data distributions (e.g., Bernoulli, normal, Poisson, etc.). Since Bayesian inference does not require sample sizes to be fixed a priori (Berger and Wolpert, 1988), this approach allows one to monitor the analysis output as data accumulate. A Bayesian estimation approach also enables the incorporation of expert knowledge via the specification of a prior distribution that captures the expert’s knowledge about a parameter of interest. However, this approach operates under the assumption that an effect exists –since a continuous prior assigns zero probability to a single null value– and consequently does not allow one to obtain evidence in favor of the null hypothesis of no effect. For instance, \pkgbayesAB and \pkgBayesianFirstAid provide the user with the posterior probability that one option yields more successes than the other, but this ignores the fact that both options could be equally effective. Furthermore, the currently used Bayesian estimation approaches –such as the one implemented in \pkgbayesAB and \pkgBayesianFirstAid– typically assign independent priors to the success probabilities of the control and treatment condition, a practice that was critiqued by Howard (1998).222“do English or Scots cattle have a higher proportion of cows infected with a certain virus? Suppose we were informed (before collecting any data) that the proportion of English cows infected was . With independent uniform priors we would now give () a probability of (because the chance that is still ). In very many cases this would not be appropriate. Often we will believe (for example) that if is 80%, will be near 80% as well and will be almost equally likely to be larger or smaller.” (p. 363)
To overcome the limitations of the current A/B tests we developed the \pkgabtest package in \proglangR (\proglangR Core Team, 2019). The \pkgabtest package implements one form of Bayesian inference for the A/B test, using informed prior distributions that induce a dependency between the two success probabilities. The analysis approach is based on a model by Kass and Vaidyanathan (1992); for alternative approaches see Deng et al. (2016), Jamil et al. (2017), Pham-Gia et al. (2017), and Skorski (2019). The implemented Bayesian procedure allows users (1) to obtain evidence in favor of the null hypothesis (e.g., Berger and Delampady, 1987; Wagenmakers et al., 2018); (2) monitor the evidence as the data accumulate (e.g., Rouder, 2014); and (3) elicit and incorporate expert prior knowledge (e.g., O’Hagan, 2019). The \pkgabtest package thus fulfills all three desiderata mentioned above.
The \pkgabtest package provides functionality for both hypothesis testing and parameter estimation. In line with Jeffreys (1939) and Fisher (1928), we believe that testing and estimation are complementary activities (Haaf et al., 2019): before a parameter is estimated, it should be tested whether there is anything to justify estimation at all. Jeffreys (1939, p. 345) related this principle to Occam’s razor: “variation must be taken as random until there is positive evidence to the contrary” (see also Kass and Raftery, 1995, Section 8.1). However, some researchers and practitioners oppose this idea, for instance because they believe that one should replace hypothesis testing with parameter estimation (Gelman and Rubin, 1995, e.g.,; Cumming, 2014). Nevertheless, the \pkgabtest package may also be useful for researchers without an interest in hypothesis testing, since the package can also be used exclusively for Bayesian parameter estimation (and prior elicitation).
This article is organized as follows: The next section introduces a fictitious business example. Afterwards, the implementation details of the Bayesian A/B test procedure used in \pkgabtest are discussed. Subsequently, the fictitious example is continued and the functionality of the \pkgabtest package and the practical benefits of the implemented approach are demonstrated. Next, a real data medical example is used to demonstrate further functionality of the package. The article ends with concluding comments.
2 Example 1: effectiveness of resilience training
Suppose the managers of a large consultancy firm are interested in reducing the number of employees who quit within the first six months, possibly due to the high stress involved in the job. A coaching company offers a resilience training and claims that this training greatly reduces the number of employees who quit. Implementing the training for all newly hired employees would be expensive and some of the managers are not completely convinced that the training is at all effective. Therefore, the managers decide to run an A/B test where half of a sample of newly hired employees will receive the training, the other half will not be trained. The outcome variable is whether or not an employee quit within the first six months (1 = still on the job, 0 = quit).
The consultancy firm collects observations ( in each group). These (fictitious) data333The data set is structured such that the sequential nature of the data is retained: the data set contains the number of observations and the number of successes in each of the two groups after each observation. are included in the \pkgabtest package (i.e., \codeseqdata). The number of employees still on the job after six months is in the group without training and in the trained group. Figure 1 provides an illustration of some of the information that can be obtained by analyzing these data using \pkgabtest. The figure displays the probability of the hypothesis that the training has a positive effect (i.e., ), negative effect (i.e., ), and no effect (i.e., ) as a function of the number of observations across the two groups. The top part of the figure displays the probability of the three hypotheses before and after taking into account the observed data (i.e., prior and posterior probabilities) as probability wheels (e.g., Tversky, 1969; Lipkus and Hollands, 1999). Before providing more details about how to obtain and interpret this result as well as providing additional analyses, we discuss the implementation details of the A/B test procedure used by \pkgabtest.
3 Implementation details
The Bayesian A/B test implemented in the \pkgabtest package is based on Kass and Vaidyanathan (1992, Section 3, “Testing Equality of Two Binomial Proportions”). Appendix A-C provide detailed derivations.
3.1 Model
Let denote the number of successes for option A with denoting the corresponding total number of observations for option A. Similarly, denotes the number of successes for option B with denoting the corresponding total number of observations for option B. The Bayesian A/B test model based on Kass and Vaidyanathan (1992) is specified as follows:444Note that this is equivalent to a logistic regression model with a binary covariate (i.e., group membership) that is coded using .
[TABLE]
Therefore, the model assumes that and follow binomial distributions with success probabilities and . These probabilities are functions of the two model parameters, and . Specifically, the log odds corresponding to are given by and the log odds corresponding to are given by . The nuisance parameter corresponds to the grand mean of the log odds and the test-relevant parameter corresponds to the log odds ratio. When is positive, this implies that (i.e., option B has a higher success probability than option A); when is negative this implies that (i.e., option B has a lower success probability than option A).
3.2 Hypotheses
The \pkgabtest package enables both estimation of the model parameters and testing of hypotheses about the test-relevant log odds ratio parameter . There are four hypotheses that are of potential interest:
The null hypothesis which states that the success probabilities and are identical, that is, . This is equivalent to . This hypothesis corresponds to the claim that there is no difference between options A and B (i.e., the “A/A test”). 2. 2.
The two-sided alternative hypothesis which states that the two success probabilities and are not equal (i.e., ), but does not specify which of the two is larger. This is equivalent to . This hypothesis corresponds to the claim that options A and B differ but it is not specified which one yields more successes. 3. 3.
The one-sided hypothesis which states that the second success probability is larger than the first success probability . This is equivalent to . This hypothesis corresponds to the claim that option B yields more successes than option A. 4. 4.
The one-sided hypothesis which states that the first success probability is larger than the second success probability . This is equivalent to . This hypothesis corresponds to the claim that option A yields more successes than option B.
Researchers who conduct an A/B test are usually interested in answering the question: Does option B yield more successes than option A (i.e., ), fewer successes than option A (i.e., ), or is there no difference between options A and B (i.e., )? Therefore, it may be argued that the hypotheses of interest are typically , , and . Consequently, by default, only these three hypotheses are assigned non-zero prior probability in the \pkgabtest package. Specifically, a default prior probability of is assigned to the hypothesis that there is no effect (i.e., ), and the remaining prior probability is split evenly across the hypothesis that there is a positive effect (i.e., receives ) and a negative effect (i.e., also receives ). The user may change these default prior probabilities to custom values.
Table 1 provides an overview of five qualitatively different tests that can be conducted by assigning prior probabilities to hypotheses in certain ways.555Note that, except for the first column of Table 1 which displays the default setting, the remaining examples use equal prior probabilities for all hypotheses that are assigned non-zero prior probability. However, the user can of course also assign prior probability unevenly to the hypotheses of interest (e.g., if prior knowledge exists about the relative plausibility of the rival hypotheses). The first column displays the default setting that assigns probability to the null hypothesis and splits the remaining probability evenly across and . The second column displays a prior probability assignment that implements an undirected test (i.e., is compared to the undirected ). The third column displays a prior probability assignment for testing whether the effect is non-existent or positive. The fourth column displays a prior probability assignment for testing whether the effect is non-existent or negative. Finally, the fifth column displays a prior probability assignment for a test of direction, that is, for testing whether the effect is positive or negative. This last setting may be of interest whenever the null hypothesis is a priori deemed implausible, uninteresting, or irrelevant.
3.3 Parameter priors
The \pkgabtest package assigns normal priors to the model parameters: and . As illustrated in the example below, these priors result in a dependency in the implied prior for the success probabilities and , which is generally desirable (Howard, 1998).
For the one-sided hypotheses and , the prior on is truncated at zero. Specifically, for , the prior on is a truncated normal distribution with parameters and and lower bound at zero. For , the prior on is a truncated normal distribution with parameters and and upper bound at zero. These normal priors are computationally convenient and sufficiently flexible to encode a wide range of prior information.
By default, the \pkgabtest package assigns standard normal priors to both and . For the nuisance parameter , a standard normal prior results in a relatively flat implied prior on and when . Generally, the choice of a prior for the nuisance parameter is relatively inconsequential (Kass and Vaidyanathan, 1992). In contrast, the prior on the test-relevant parameter is consequential, as it defines the extent to which the hypotheses of interest differ from . Our choice for a default standard normal prior on the test-relevant parameter is motivated by the fact that a zero-centered prior does not favor any of the two options A or B a priori. Furthermore, the standard deviation of 1 results in a prior distribution that assigns mass to a wide range of reasonable log odds ratios (Chen et al., 2010) without being so uninformative that the results unduly favor (Bartlett, 1957; Lindley, 1957).666Note that the default implied prior on the absolute risk is considerably more narrow than the prior induced by the popular default choice that assigns and independent uniform distributions (Jeffreys, 1935). However, large changes in the prior standard deviation of the test-relevant parameter may result in large changes in the results, as the prior standard deviation governs the degree to which the hypothesis of interest makes predictions that differ from . To include prior knowledge about the expected results, the \pkgabtest package allows the user to change the default values of the prior distributions for the nuisance parameter and the test-relevant parameter , either by changing the location of the normal prior distribution, the scale, or both.
3.4 Encoding prior information
A straightforward way to encode prior information about the model parameters is to set , , , and directly. However, it may sometimes be easier to specify prior distributions based on quantities such as the (log) odds ratio, relative risk (i.e., , the ratio of the success probability in condition B and condition A), and absolute risk (i.e., , the difference of the success probability in condition B and condition A). The \codeelicit_prior function allows users to encode prior information about a quantity of interest (either log odds ratio, odds ratio, relative risk, or absolute risk). The function assumes that the prior on is not the primary target of prior elicitation and is fixed by the user a priori (using the arguments \codemu_beta and \codesigma_beta) – for instance, to a standard normal prior which corresponds to a relatively flat implied prior on and when .
To encode prior information, the user needs to provide quantiles for a quantity of interest. Let denote the values of quantiles provided by the user and let denote the corresponding probabilities (e.g., for the median, ). Least-squares minimization is used to obtain and as follows:
[TABLE]
where corresponds to the cumulative distribution function (cdf) for the quantity of interest implied by the normal prior on . For some quantities, this cdf also depends on the prior for ; however, as described above, it is assumed that and are fixed a priori.
3.5 Hypothesis testing
To quantify the evidence that the data provide for , , , and , one can compute Bayes factors (Jeffreys, 1939; Kass and Raftery, 1995) and posterior probabilities of the rival hypotheses. The posterior probability of hypothesis , is given by:
[TABLE]
The Bayes factor for comparing hypotheses and equals the change from prior to posterior odds:
[TABLE]
In order to obtain posterior probabilities of the hypotheses and Bayes factors one needs to evaluate the marginal likelihood for each hypothesis . For and , we evaluate the marginal likelihood using Laplace approximations as suggested by Kass and Vaidyanathan (1992). Specifically, the marginal likelihood for is approximated by:
[TABLE]
where , corresponds to the mode of , and \sigma_{0}^{2}=\left(-\frac{d^{2}}{d\beta^{2}}\,l_{0}^{\ast}(\beta)\right)^{-1}\bigg{\rvert}_{\beta=\beta_{0}^{\ast}} denotes the inverse of the negative second derivative of evaluated at the mode .
The marginal likelihood for is approximated by:
[TABLE]
where , denotes the mode of , and \bm{\Sigma}_{1}=\left(-\bm{H}_{1}\right)^{-1}\big{\rvert}_{(\beta,\psi)=(\beta^{\ast},\psi^{\ast})} denotes the inverse of the negative Hessian (i.e., the matrix with second-order partial derivatives) of evaluated at the mode .
These Laplace approximations work well in practice, even for sample sizes that are extremely small. As a demonstration, for a range of synthetic data sets we computed the (log of the) Bayes factor which compares to using the above Laplace approximations and, as a comparison, also using bridge sampling (Meng and Wong, 1996; Gronau et al., 2020). The priors on and were standard normal distributions. Figure 2 displays the results and confirms that the Laplace approximation yields accurate results, even for sample sizes as small as .
For the one-sided hypotheses and , Laplace approximations did not appear to yield accurate results for small sample sizes, even after removing the constraint on through the parameterization for and for . The \pkgabtest package therefore uses importance sampling to increase the accuracy of the Laplace approximations when computing the marginal likelihoods for and . Specifically, a Laplace approximation is used to approximate the mode and covariance matrix of the posterior. The importance density is then given by a multivariate distribution with location set to the approximated posterior mode, scale matrix set to the approximated posterior covariance matrix, and five degrees of freedom (note that the user can change the degrees of freedom). The marginal likelihood for is then estimated as follows:
[TABLE]
where denotes samples from the multivariate importance density , and
[TABLE]
where denotes the probability density function of a normal distribution with mean and variance that is evaluated at . Furthermore, denotes the density of a normal distribution that is truncated to allow only positive values for . The marginal likelihood for is computed analogously.
3.6 Obtaining posterior samples
In a Bayesian A/B test application, one may not only be interested in testing hypotheses, but also in obtaining posterior samples for the model parameters under , , and . The \pkgabtest package allows the user to obtain posterior samples using sampling importance resampling (e.g., Robert and Casella, 2010). Specifically, posterior samples for are obtained as follows (samples for the other hypotheses are obtained in an analogous manner):
Generate samples from the multivariate proposal distribution mentioned before, denoted by . 2. 2.
Compute the importance weights:
[TABLE] 3. 3.
Renormalize the importance weights: , . 4. 4.
Resample (with replacement) from the samples obtained from the importance density according to the normalized importance weights which yields (approximate) samples from the posterior distribution.
4 Example 1: effectiveness of resilience training (continued)
Next we continue the effectiveness of resilience training example and show how expert prior information can be taken into account, how the hypotheses of interest can be tested, and how one can estimate the model parameters using the \pkgabtest package.
4.1 Prior specification
Before commencing the A/B test, the managers asked the coaching company to specify how effective they believe the training will be. The coaching company claimed that, based on past experience with the training, they expect the proportion of employees who do not quit within the first six months to be 15% larger for the group who received the training, with a 95% uncertainty interval ranging from a 2.5% benefit to a 27.5% benefit. Assuming that the claimed 15% corresponds to the prior median, this expectation corresponds to a median absolute risk (i.e., ) of with a 95% uncertainty interval ranging from to . The \codeelicit_prior function can be used to encode this prior information:777All code and plots are also available at https://osf.io/t3ajr/. {Sinput} R> library("abtest") R> prior_par <- elicit_prior(q = c(0.025, 0.15, 0.275),
- prob = c(.025, .5, .975),
- what = "arisk")
The obtained prior on the absolute risk can be visualized as follows: {Sinput} R> plot_prior(prior_par, what = "arisk")
The resulting graph is shown in the top panel of Figure 3.
The user can also visualize the (implied) prior for other quantities. For instance, the prior on the log odds ratio (middle panel of Figure 3) is obtained as follows: {Sinput} R> plot_prior(prior_par, what = "logor")
The implied prior on the success probabilities and (bottom panel of Figure 3) is obtained as follows: {Sinput} R> plot_prior(prior_par, what = "p1p2")
The bottom panel of Figure 3 illustrates that there is a dependency between and which is arguably desirable (Howard, 1998): When one of the success probabilities is very (small) large, it is likely that the other one will also be (small) large.
4.2 Hypothesis testing
Since the number of employees still on the job after six months is in the group without training and in the trained group, the observed success probabilities are in the control group and in the group that received training. Consequently, the observed success probabilities suggest that there is a positive effect of the training of 4%; however, a statistical analysis is required to assess whether this observed difference is statistically compelling. The \codeab_test function can be used to conduct a Bayesian A/B test as follows: {Sinput} R> data("seqdata") R> set.seed(1) R> ab <- ab_test(data = seqdata, prior_par = prior_par)
This yields the following output: {Soutput} R> print(ab)
Bayesian A/B Test Results:
Bayes Factors:
BF10: 0.1406443 BF+0: 0.13823 BF-0: 0.4920187
Prior Probabilities Hypotheses:
H+: 0.25 H-: 0.25 H0: 0.5
Posterior Probabilities Hypotheses:
H+: 0.0526 H-: 0.1871 H0: 0.7604
The first part of the output presents Bayes factors in favor of the hypotheses , , and , where the reference hypothesis (i.e., denominator of the Bayes factor) is . Since all three Bayes factors are smaller than 1, they all indicate evidence in favor of the null hypothesis of no effect. The next part of the output displays the prior probabilities of the hypotheses with non-zero prior probability. As explained before, the default setting assigns probability to the null hypothesis and splits the remaining probability evenly across and . The user can change this default setting via the \codeprior_prob argument (e.g., to assign non-zero probability to ). The final part of the output displays the posterior probabilities of the hypotheses with non-zero prior probability. The posterior probability of the null hypothesis indicates that the data have increased the plausibility of the null hypothesis from to . Furthermore, the data have decreased the plausibility of both and .
As an aside, it may appear paradoxical that the data indicate a 4% positive effect of the training and yet the posterior probability of is larger than that of . The reason for this result is that the company’s prior was overly ambitious, and is penalized for having predicted effects that are much too large. Furthermore, note that the test-relevant prior distribution under is obtained by truncating the prior on at zero and renormalizing. Since the company’s prior assigns almost all mass to positive log odds ratio values, renormalizing the negative part of the distribution results in a prior that is highly similar to ; this explains why receives non-trivial posterior probability. These considerations underscore the fact that the outcome of a Bayesian analysis is always relative to the specific set of models (and associated prior distributions) under consideration. Because highly informed priors can exert a large influence on the results, it is generally wise to examine the robustness of the conclusions by executing the default analysis as well. This analysis is reported in Appendix D.
The \pkgabtest package allows users to visualize the posterior probabilities of the hypotheses by means of a probability wheel (Figure 4): {Sinput} R> prob_wheel(ab)
Overall, the data support the hypothesis that the training is ineffective over the company’s hypothesis that the training is highly effective. The Bayes factor for over equals , which indicates moderate evidence (Jeffreys, 1939, Appendix I).
Since the data set is of a sequential nature, it may be of interest to consider not only the result based on all observations, but to conduct also a sequential analysis that tracks the evidential flow as a function of the total number of observations (i.e., the number of observations across both groups). This sequential analysis can be conducted as follows: {Sinput} R> plot_sequential(ab, thin = 4)
Setting the \codethin argument to \code4 indicates that the evidence is computed after every 4 observation. Thinning can be useful to speed up the analysis in case the data set is very large or in case observations arrive in batches. Figure 1 displays the result of the sequential analysis. The posterior probability of each hypothesis with non-zero prior probability is plotted as a function of the total number of observations. At the top, two probability wheels visualize the prior probabilities of the hypotheses and the posterior probabilities of the hypotheses based on all available data. Figure 1 shows that after some initial fluctuation, adding more observations increased the probability of the null hypothesis that there is no effect of the training.
4.3 Parameter estimation
The data indicate evidence in favor of the null hypothesis versus the hypothesis that the training is highly effective, leaving open the possibility that the training does have an effect, but of a more modest size than the company anticipated. To assess this possibility one may investigate the potential size of the effect under the assumption that the effect is non-zero.888For consistency, we continue this analysis with the company’s prior; an analysis with the less enthusiastic default prior is provided in Appendix D. For parameter estimation, we generally prefer to investigate the posterior distribution for the unconstrained alternative hypothesis ; however, the \pkgabtest package also provides posterior samples and plotting functionality for the constrained hypotheses and .
The top panel of Figure 5 displays the posterior distribution for the absolute risk (i.e., ) that can be obtained as follows: {Sinput} R> plot_posterior(ab, what = "arisk")
The top panel of Figure 5 shows the prior distribution as a dotted line and the posterior distribution (with 95% central credible interval) as a solid line. The plot indicates that, under the assumption that the difference between the two success probabilities is not exactly zero, it is likely to be smaller than expected: the posterior median is and the 95% central credible interval ranges from to .
The middle panel of Figure 5 displays the posterior distribution for the log odds ratio that can be obtained as follows: {Sinput} R> plot_posterior(ab, what = "logor")
The middle panel of Figure 5 indicates that, given the log odds ratio is not exactly zero, it is likely to be between and , where the posterior median is .
It may also be of interest to consider the marginal posterior distributions of the success probabilities and . This plot can be produced as follows: {Sinput} R> plot_posterior(ab, what = "p1p2")
The bottom panel of Figure 5 displays the resulting plot. In this example, and correspond to the probability of still being on the job after six month for the non-trained employees and the employees that received the training, respectively. The bottom panel of Figure 5 indicates that the posterior median for is , with 95% credible ranging from to , and the posterior median for is , with 95% credible interval ranging from to .
In sum, this fictitious data set offers modest evidence in favor of the null hypothesis which states that the training is not effective over the hypothesis that the training is highly effective; nevertheless, the consultancy firm should probably continue to collect data in order to obtain more compelling evidence before deciding whether or not the training should be implemented. If the true effect is as small as 4%, continued testing will ultimately show compelling evidence for over . Note that continued testing is trivial in the Bayesian framework: the results can simply be updated as new observations arrive.
5 Example 2: progesterone in women with bleeding in early pregnancy
As a second example application of the \pkgabtest package, here we present a reanalysis of a recent medical trial.999This reanalysis is also available on PsyArXiv: Gronau, Q. F., & Wagenmakers, E.–J. (2019). Progesterone in women with bleeding in early pregnancy: Absence of evidence, not evidence of absence. https://psyarxiv.com/etk7g/ Coomarasamy et al. (2019) assessed the effectiveness of progesterone in preventing miscarriages. The number of live births was 74.7% (1513/2025) in the progesterone group and 72.5% (1459/2013) in the placebo group (). The authors concluded: “The incidence of adverse events did not differ significantly between the groups” (Coomarasamy et al., 2019, p. 1815).
This conclusion leaves unaddressed the degree to which the data undercut or support the no-effect hypothesis over the positive-effect hypothesis . To quantify such evidence we can use the \pkgabtest package. A default analysis can be conducted as follows: {Sinput} R> data <- list(y1 = 1459, n1 = 2013, y2 = 1513, n2 = 2025) R> set.seed(1) R> ab <- ab_test(data = data)
This yields the following output: {Soutput} R> print(ab)
Bayesian A/B Test Results:
Bayes Factors:
BF10: 0.259709 BF+0: 0.4866008 BF-0: 0.02796485
Prior Probabilities Hypotheses:
H+: 0.25 H-: 0.25 H0: 0.5
Posterior Probabilities Hypotheses:
H+: 0.1935 H-: 0.0111 H0: 0.7954
A Bayes factor of indicates that there is only weak evidence in favor of the no-effect hypothesis over the positive-effect hypothesis (Jeffreys, 1939). To alleviate concerns about the choice of the prior distribution for the test-relevant log odds ratio parameter one can conduct a prior robustness analysis as follows: {Sinput} R> plot_robustness(ab, bftype = "BF0+")
Note that the \codebftype argument is used to indicate which Bayes factor is plotted (in this case ). Figure 6 displays the results and shows that the evidence is weak for all combinations of and .
In sum, these data neither undercut nor support the progesterone hypothesis in compelling fashion.
6 Concluding comments
In this article, we have introduced the \pkgabtest package that implements both Bayesian hypothesis testing and Bayesian estimation for the A/B test using informed priors. The procedure allows users to (1) obtain evidence in favor of the null hypothesis; (2) monitor the evidence as data accumulate; and (3) elicit and incorporate expert prior distributions. We hope that the provided analysis approach is useful across different fields that apply A/B testing on a routine basis, particularly business and medicine.
We have introduced the approach implemented in \pkgabtest as testing hypotheses of interest about the test-relevant log odds ratio parameter for the model in Equation 1. However, it should be pointed out that an alternative interpretation is to view the procedure as estimating a mixture model, where the mixture components correspond to the different hypotheses of interest, and the mixture weights are given by the prior/posterior probabilities of the hypotheses (e.g., Mitchell and Beauchamp, 1988). This interpretation is illustrated with a fictitious example in Figure 7. For simplicity, the plot assumes that the user has set the prior probabilities of and to zero, whereas the prior probabilities of and are both set to .50. The left panel illustrates the mixture representation before having observed any data. Specifically, the height of the spike at zero corresponds to the prior probability of whereas the shape of the slab corresponds to the continuous default prior distribution for under . The maximum height of this continuous distribution corresponds to the prior probability of .101010This scaling method is inspired by the \pkgBAS package (Clyde, 2020). The right panel illustrates the mixture representation after having observed 20 successes out of 40 observations in the control condition and 30 successes out of 40 observations in the experimental condition (these are fictitious data). The height of the spike corresponds to the posterior probability of , and the maximum height of the continuous posterior distribution under (i.e., the slab) corresponds to the posterior probability of . In this fictitious example, the data have decreased the plausibility of and have increased the plausibility of .
Despite the practical benefits that the package offers right now, there are areas for future improvement. For instance, \pkgabtest currently allows users to compare two groups; however, there are applications in which one may be interested in simultaneously comparing more than two groups. Furthermore, at the moment, \pkgabtest expects the outcome variable to be binary. Nevertheless, in certain scenarios, it may be more natural to compare the two groups based on a continuous outcome variable. This scenario resembles an independent samples -test for which well-established Bayesian procedures exist (e.g., Rouder et al., 2009; Ly et al., 2016) which are available, for instance, in the \pkgBayesFactor package (Morey and Rouder, 2018) and \proglangJASP (\proglangJASP Team, 2020).111111For a list of Bayesian \proglangR packages, see https://cran.r-project.org/web/views/Bayesian.html. Moreover, currently, the \pkgabtest package does not provide functions for generating predictions. Note, however, that users can generate predictions in a straightforward manner themselves based on the posterior samples that are provided by \pkgabtest. The implementation also does not allow users to incorporate utilities explicitly (e.g., Lindley, 1985; for alternative approaches see also Azevedo et al., 2019 and Feit and Berman, 2019). However, again, based on the provided posterior probabilities and posterior samples, users who wish to take into account utilities may do so in a relatively straightforward way. Furthermore, users interested in adjusting the model used in \pkgabtest (e.g., to account for hierarchically-structured data or covariates) are referred to general-purpose Bayesian software such as \proglangStan (Carpenter et al., 2017; \proglangStan Development Team, 2019) and the related \proglangR package \pkgbrms (Bürkner, 2017). In combination with the \pkgbridgesampling package (Gronau et al., 2020), this enables the user to compare custom models using Bayes factors and posterior model probabilities. A more structural limitation of \pkgabtest is that it has been developed to analyze A/B test data, but not to run the A/B test experiment itself.
In sum, A/B testing is ubiquitous in business and medicine. Here we have demonstrated how the \pkgabtest package enables relatively complete Bayesian inference including the capability to obtain support for the null, continuously monitor the results, and elicit and incorporate expert prior knowledge. Hopefully, this approach forms a basis for evidence-based conclusions that will benefit both businesses and patients.
7 Acknowledgements
This research was supported by a Netherlands Organisation for Scientific Research (NWO) grant to QFG (406.16.528) and by an NWO Vici grant to EJW (016.Vici.170.083).
Appendix A Interpretation of the parameters
Here we show that corresponds to the grand mean of the log odds and that corresponds to the log odds ratio (for the model definition, see Equation 1). The nuisance parameter corresponds to the grand mean of the log odds since
[TABLE]
The test-relevant parameter corresponds to the log odds ratio since
[TABLE]
Appendix B Prior elicitation: implied distributions
The prior elicitation approach described in Equation 2 requires the cdf’s for the quantities of interest. Here, we derive the implied cdf’s for these quantities; we also derive the corresponding probability density functions (pdf’s). Additionally, we derive four further implied distributions of interest: the joint pdf of and , the conditional pdf of given is fixed to a particular value, the marginal distribution for , and the marginal distribution for . A few of these expressions will contain a one-dimensional integral which can easily be evaluated using numerical integration.
B.1 Log odds ratio
Since itself corresponds to the log odds ratio, corresponds in this case to the cdf of a normal distribution with mean and standard deviation . The corresponding pdf is the normal probability density function.
B.2 Odds ratio
The implied prior on the odds ratio is a log-normal distribution. Hence, corresponds in this case to the cdf of a log-normal distribution with parameters and . The corresponding pdf is the log-normal probability density function.
B.3 Relative risk
The relative risk is given by . We use a capital letter (i.e., ) to refer to the random variable and use a lower-case letter (i.e., ) to refer to a concrete realization. Note that so far, we have abused notation by only using lower-case letters, but it should be clear from the context when we referred to a random variable or a concrete realization. However, for deriving the following cdf, we need the distinction to keep the notation clear. To derive the implied cdf for the relative risk, we proceed as follows:
[TABLE]
Taking reciprocals and some algebra yields
[TABLE]
When we set
[TABLE]
we can solve for using the fact that this is a quadratic equation in and we obtain:
[TABLE]
where we took into account that needs to be positive (i.e., we omitted the solution corresponding to minus the square root). Hence,
[TABLE]
Therefore, whenever
[TABLE]
Hence, the desired cdf can be written as
[TABLE]
where denotes the cdf of a normal distribution with mean and variance , and denotes the corresponding pdf.
The pdf of the relative risk is obtained by taking the derivative with respect to :
[TABLE]
B.4 Absolute risk
The absolute risk is given by . We use the upper-case letter to refer to the random variable and the lower-case letter to refer to a concrete realization. To derive the implied cdf for the absolute risk, we proceed as follows:
[TABLE]
After some algebra, we obtain
[TABLE]
When we set
[TABLE]
we can solve for using the fact that this is a quadratic equation in and we obtain:
[TABLE]
where we took into account that needs to be positive (i.e., we omitted the solution corresponding to minus the square root). Hence,
[TABLE]
Therefore, whenever
[TABLE]
Hence, the desired cdf can be written as
[TABLE]
The pdf of the absolute risk is obtained by taking the derivative with respect to :
[TABLE]
B.5 Joint distribution of and
Another distribution of interest is the implied joint distribution of the two success probabilities and . This distribution will not be used to elicit the prior on which is the reason why we only derive the pdf and not the cdf. The model parameters and are related to and as follows:
[TABLE]
Hence, the inverse transformation is given by:
[TABLE]
The corresponding Jacobian is:
[TABLE]
Therefore, the joint pdf of and is given by:
[TABLE]
B.6 Marginal distribution of
The marginal distribution of is given by:
[TABLE]
B.7 Marginal distribution of
The marginal distribution of is given by:
[TABLE]
B.8 Conditional distribution of given
Another distribution of interest is the conditional distribution of the second success probability given a particular value of . This distribution will not be used for prior elicitation which is the reason why we only present the expression for the pdf which is given by:
[TABLE]
B.9 Implied distributions for truncated priors on the log odds ratio
Note that the above expressions can be all easily modified in case the prior on the log odds ratio is a truncated normal distribution (e.g., restricting to be larger/smaller than zero) which is the case for the hypotheses and . In this case, the normal prior density function and cumulative distribution function for simply need to be replaced by the truncated versions. For the implied log-normal prior on the odds ratio, the truncation bounds simply need to be exponentiated to obtain the truncation bounds with respect to the log-normal prior.
Appendix C Laplace approximation details
The Laplace approximations require first-order and second-order derivatives. Let us first state explicitly the functions for which we need to find the derivatives. For we have:
[TABLE]
For we have:
[TABLE]
For we have:
[TABLE]
Finally, for we have
[TABLE]
C.1 First-order derivatives
The first-order derivatives are used to find the modes for the Laplace approximations. As shown below, we can find these derivatives analytically; however, setting the derivatives equal to zero and solving for the parameters is not straightforward. Nevertheless, having these derivatives is useful not only as an intermediate step to finding the second-order derivatives but also for finding the modes: This allows us to provide numerical optimizers with the analytic expressions for the derivatives which can increase speed and accuracy for numerically finding the modes of the relevant functions.
The first-order derivative for is given by:
[TABLE]
The first-order partial derivatives for are given by
[TABLE]
and
[TABLE]
The first-order partial derivatives for are given by:
[TABLE]
and
[TABLE]
The first-order partial derivatives for are given by:
[TABLE]
and
[TABLE]
C.2 Second-order derivatives
For the Laplace approximations, we also need the inverse of the negative Hessians. The Hessian is the matrix with the second-order partial derivatives which is the reason why we now present expressions for the second-order partial derivatives. Note that under all hypotheses there are either one or two parameters. Hence, the Hessians will be at most 2 by 2 matrices. For matrices up to 2 by 2, it is straightforward to find the inverse and the determinant which makes it easy to obtain the quantities needed for the Laplace approximations once we have the required derivatives.
For , there is only one parameter and the second-order derivative is given by:
[TABLE]
For the second-order partial derivatives are given by
[TABLE]
and
[TABLE]
and
[TABLE]
For the second-order partial derivatives are given by
[TABLE]
and
[TABLE]
and
[TABLE]
For the second-order partial derivatives are given by
[TABLE]
and
[TABLE]
and
[TABLE]
C.3 Hessians
Having derived the relevant second-order partial derivatives, we can simply build the Hessian matrices of interest by inserting the relevant expressions. Next, we present symbolically the Hessians of interest, that is, we show which of the second-order partial derivatives need to be inserted where. Note that we omit the one for since this is a single number which is simply the second-order derivative of .
The Hessian for is given by:
[TABLE]
The Hessian for is given by:
[TABLE]
The Hessian for is given by:
[TABLE]
C.3.1 Computing the inverse of the negative Hessians
Note that computing the inverses of the 2 by 2 negative Hessians is straightforward: We simply need to attach minus signs to each element of the Hessians and then make use of the fact that the inverse of a 2 by 2 matrix is given by , where .
Appendix D Example 1: effectiveness of resilience training (default analysis)
Here we present the results for the resilience training example obtained using the default prior setting.
D.1 Prior specification
We use the default prior setting in the \pkgabtest package that assigns both and standard normal prior distributions. The implied prior on the absolute risk can be visualized as follows:
R> library("abtest") R> plot_prior(what = "arisk")
The resulting graph is shown in the top panel of Figure 8.
The user can also visualize the (implied) prior for other quantities. For instance, the prior on the log odds ratio (middle panel of Figure 8) is obtained as follows:
R> plot_prior(what = "logor")
The implied prior on the success probabilities and (bottom panel of Figure 8) is obtained as follows:
R> plot_prior(what = "p1p2")
The bottom panel of Figure 8 illustrates that there is a dependency between and which is arguably desirable (Howard, 1998): When one of the success probabilities is very (small) large, it is likely that the other one will also be (small) large.
D.2 Hypothesis testing
The ab_test function can be used to conduct a Bayesian A/B test using the default prior setting as follows:
R> data("seqdata") R> set.seed(1) R> ab_default <- ab_test(data = seqdata)
This yields the following output:
R> print(ab_default)
Bayesian A/B Test Results:
Bayes Factors:
BF10: 0.2767214 BF+0: 0.4890489 BF-0: 0.05778357
Prior Probabilities Hypotheses:
H+: 0.25 H-: 0.25 H0: 0.5
Posterior Probabilities Hypotheses:
H+: 0.192 H-: 0.0227 H0: 0.7853
The first part of the output presents Bayes factors in favor of the hypotheses , , and , where the reference hypothesis (i.e., denominator of the Bayes factor) is . Since all three Bayes factors are smaller than 1, they all indicate evidence in favor of the null hypothesis of no effect. The next part of the output displays the prior probabilities of the hypotheses with non-zero prior probability. The final part of the output displays the posterior probabilities of the hypotheses with non-zero prior probability. The posterior probability of the null hypothesis indicates that the data have increased the plausibility of the null hypothesis from to . Furthermore, the data have decreased the plausibility of both and .
The \pkgabtest package allows users to visualize the posterior probabilities of the hypotheses by means of a probability wheel (Figure 9):
R> prob_wheel(ab_default)
Overall, the data support the hypothesis that the training is ineffective over the hypothesis that the training has a positive effect. The Bayes factor for over equals ; however, this indicates only anecdotal evidence (Jeffreys, 1939, Appendix I).
Since the data set is of a sequential nature, it may be of interest to consider not only the result based on all observations, but to conduct also a sequential analysis that tracks the evidential flow as a function of the total number of observations (i.e., the number of observations across both groups). This sequential analysis can be conducted as follows:
R> plot_sequential(ab_default, thin = 4)
Figure 10 displays the result of the sequential analysis. The sequential analysis indicates that after some initial fluctuation, adding more observations increased the probability of the null hypothesis that there is no effect of the training.
D.3 Parameter estimation
The data indicate only anecdotal evidence in favor of the null hypothesis versus the hypothesis that the training is effective, leaving open the possibility that the training does have an effect. To assess this possibility one may investigate the potential size of the effect under the assumption that the effect is non-zero. For parameter estimation, we generally prefer to investigate the posterior distribution for the unconstrained alternative hypothesis .
The top panel of Figure 11 displays the posterior distribution for the absolute risk (i.e., ) that can be obtained as follows:
R> plot_posterior(ab_default, what = "arisk")
The top panel of Figure 11 shows the prior distribution as a dotted line and the posterior distribution (with 95% central credible interval) as a solid line. The plot indicates that, under the assumption that the difference between the two success probabilities is not exactly zero, the posterior median is and the 95% central credible interval ranges from to .
The middle panel of Figure 11 displays the posterior distribution for the log odds ratio that can be obtained as follows:
R> plot_posterior(ab_default, what = "logor")
The middle panel of Figure 11 indicates that, given the log odds ratio is not exactly zero, it is likely to be between and , where the posterior median is .
It may also be of interest to consider the marginal posterior distributions of the success probabilities and . This plot can be produced as follows:
R> plot_posterior(ab_default, what = "p1p2")
The bottom panel of Figure 11 displays the resulting plot. In this example, and correspond to the probability of still being on the job after six month for the non-trained employees and the employees that received the training, respectively. The bottom panel of Figure 11 indicates that the posterior median for is , with 95% credible ranging from to , and the posterior median for is , with 95% credible interval ranging from to .
In sum, based on a default prior analysis, this fictitious data set offers anecdotal evidence in favor of the null hypothesis which states that the training is not effective over the hypothesis that the training is effective; the consultancy firm should probably continue to collect data in order to obtain more compelling evidence before deciding whether or not the training should be implemented. If the true effect is as small as 4%, continued testing will ultimately show compelling evidence for over . Note that continued testing is trivial in the Bayesian framework: the results can simply be updated as new observations arrive.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Armitage (1960) Armitage P (1960). Sequential Medical Trials . Thomas, Springfield (IL).
- 2Azevedo et al. (2019) Azevedo EM, Alex D, Montiel Olea J, Rao JM, Weyl EG (2019). “A/B Testing with Fat Tails.” SSRN . URL http://dx.doi.org/10.2139/ssrn.3171224 . · doi ↗
- 3Bååth (2014) Bååth R (2014). “Bayesian First Aid: A Package that Implements Bayesian Alternatives to the Classical \code *.test Functions in \proglang R.” In Use R! 2014 - the International \proglang R User Conference .
- 4Bartlett (1957) Bartlett MS (1957). “A Comment on D. V. Lindley’s Statistical Paradox.” Biometrika , 44 , 533–534.
- 5Berger and Delampady (1987) Berger JO, Delampady M (1987). “Testing Precise Hypotheses.” Statistical Science , 2 , 317–352.
- 6Berger and Wolpert (1988) Berger JO, Wolpert RL (1988). The Likelihood Principle (2nd ed.) . Institute of Mathematical Statistics, Hayward (CA).
- 7Berman et al. (2018) Berman R, Pekelis L, Scott A, Van den Bulte C (2018). “p-Hacking and False Discovery in A/B Testing.” SSRN . URL http://dx.doi.org/10.2139/ssrn.3204791 . · doi ↗
- 8Bürkner (2017) Bürkner PC (2017). “ \pkg brms: An \proglang R Package for Bayesian Multilevel Models Using \proglang Stan.” Journal of Statistical Software , 80 , 1–28.
