Informed Bayesian Inference for the A/B Test

Quentin F. Gronau; K. N. Akash Raj; and Eric-Jan Wagenmakers

arXiv:1905.02068·stat.AP·November 16, 2020·J. Stat. Softw.

Informed Bayesian Inference for the A/B Test

Quentin F. Gronau, K. N. Akash Raj, and Eric-Jan Wagenmakers

PDF

TL;DR

This paper introduces a Bayesian A/B testing method that allows for continuous evidence monitoring, incorporation of expert prior knowledge, and assessment of null effects, addressing limitations of existing approaches.

Contribution

It presents a Bayesian A/B testing procedure based on Kass and Vaidyanathan (1992) that supports evidence monitoring, null hypothesis evaluation, and prior knowledge integration.

Findings

01

Supports evidence monitoring during data collection

02

Allows explicit evaluation of null hypothesis

03

Incorporates expert prior knowledge

Abstract

Booming in business and a staple analysis in medical trials, the A/B test assesses the effect of an intervention or treatment by comparing its success rate with that of a control condition. Across many practical applications, it is desirable that (1) evidence can be obtained in favor of the null hypothesis that the treatment is ineffective; (2) evidence can be monitored as the data accumulate; (3) expert prior knowledge can be taken into account. Most existing approaches do not fulfill these desiderata. Here we describe a Bayesian A/B procedure based on Kass and Vaidyanathan (1992) that allows one to monitor the evidence for the hypotheses that the treatment has either a positive effect, a negative effect, or, crucially, no effect. Furthermore, this approach enables one to incorporate expert knowledge about the relative prior plausibility of the rival hypotheses and about the expected…

Tables1

Table 1. Table 1: Changing the prior probability assignments across rival hypotheses produces different tests.

	Test
Hypothesis	Default	Undirected	Positive	Negative	Direction
$ℋ_{0}$	.50	.50	.50	.50	0
$ℋ_{1}$	0	.50	0	0	0
$ℋ_{+}$	.25	0	.50	0	.50
$ℋ_{-}$	.25	0	0	.50	.50

Equations124

lo g (\frac{p _{1}}{1 - p _{1}}) = β - \frac{ψ}{2} lo g (\frac{p _{2}}{1 - p _{2}}) = β + \frac{ψ}{2} y_{1} \sim Binomial (n_{1}, p_{1}) y_{2} \sim Binomial (n_{2}, p_{2}) .

lo g (\frac{p _{1}}{1 - p _{1}}) = β - \frac{ψ}{2} lo g (\frac{p _{2}}{1 - p _{2}}) = β + \frac{ψ}{2} y_{1} \sim Binomial (n_{1}, p_{1}) y_{2} \sim Binomial (n_{2}, p_{2}) .

(μ_{ψ}, σ_{ψ}) = μ_{ψ}, σ_{ψ} arg min i = 1 \sum I (F (q_{i}; μ_{ψ}, σ_{ψ}) - prob_{i})^{2},

(μ_{ψ}, σ_{ψ}) = μ_{ψ}, σ_{ψ} arg min i = 1 \sum I (F (q_{i}; μ_{ψ}, σ_{ψ}) - prob_{i})^{2},

p (H_{j} ∣ data) posterior probability = \frac{p ( data ∣ H _{j} )}{\sum _{k} p ( data ∣ H _{k} ) p ( H _{k} )} updating factor \times p (H_{j}) prior probability .

p (H_{j} ∣ data) posterior probability = \frac{p ( data ∣ H _{j} )}{\sum _{k} p ( data ∣ H _{k} ) p ( H _{k} )} updating factor \times p (H_{j}) prior probability .

posterior odds \frac{p ( H _{j} ∣ data )}{p ( H _{k} ∣ data )} = Bayes factor BF_{j k} \frac{p ( data ∣ H _{j} )}{p ( data ∣ H _{k} )} \times prior odds \frac{p ( H _{j} )}{p ( H _{k} )} .

posterior odds \frac{p ( H _{j} ∣ data )}{p ( H _{k} ∣ data )} = Bayes factor BF_{j k} \frac{p ( data ∣ H _{j} )}{p ( data ∣ H _{k} )} \times prior odds \frac{p ( H _{j} )}{p ( H _{k} )} .

p (data ∣ H_{0}) = \int likelihood p (data ∣ β) prior π_{0} (β) d β \approx (2 π σ_{0}^{2})^{\frac{1}{2}} exp {l_{0}^{*} (β_{0}^{*})},

p (data ∣ H_{0}) = \int likelihood p (data ∣ β) prior π_{0} (β) d β \approx (2 π σ_{0}^{2})^{\frac{1}{2}} exp {l_{0}^{*} (β_{0}^{*})},

p (data ∣ H_{1}) = \int\int likelihood p (data ∣ β, ψ) prior π (β, ψ) d β d ψ \approx 2 π det (Σ_{1})^{\frac{1}{2}} exp {l^{*} (β^{*}, ψ^{*})},

p (data ∣ H_{1}) = \int\int likelihood p (data ∣ β, ψ) prior π (β, ψ) d β d ψ \approx 2 π det (Σ_{1})^{\frac{1}{2}} exp {l^{*} (β^{*}, ψ^{*})},

p (data ∣ H_{+}) = \int\int likelihood p (data ∣ β, ξ) prior π_{+} (β, ξ) d β d ξ \approx \frac{1}{S} s = 1 \sum S \frac{p ( data ∣ β ~ _{s} , ξ ~ _{s} ) π _{+} ( β ~ _{s} , ξ ~ _{s} )}{g _{is} ( β ~ _{s} , ξ ~ _{s} )},

p (data ∣ H_{+}) = \int\int likelihood p (data ∣ β, ξ) prior π_{+} (β, ξ) d β d ξ \approx \frac{1}{S} s = 1 \sum S \frac{p ( data ∣ β ~ _{s} , ξ ~ _{s} ) π _{+} ( β ~ _{s} , ξ ~ _{s} )}{g _{is} ( β ~ _{s} , ξ ~ _{s} )},

π_{+} (β, ξ) = N (β; μ_{β}, σ_{β}^{2}) N_{+} (exp (ξ); μ_{ψ}, σ_{ψ}^{2}) exp (ξ),

π_{+} (β, ξ) = N (β; μ_{β}, σ_{β}^{2}) N_{+} (exp (ξ); μ_{ψ}, σ_{ψ}^{2}) exp (ξ),

w_{s} = \frac{p ( data ∣ β ~ _{s} , ξ ~ _{s} ) π _{+} ( β ~ _{s} , ξ ~ _{s} )}{g _{is} ( β ~ _{s} , ξ ~ _{s} )}, s = 1, 2, \dots, S .

w_{s} = \frac{p ( data ∣ β ~ _{s} , ξ ~ _{s} ) π _{+} ( β ~ _{s} , ξ ~ _{s} )}{g _{is} ( β ~ _{s} , ξ ~ _{s} )}, s = 1, 2, \dots, S .

\frac{1}{2} lo g (\frac{p _{1}}{1 - p _{1}}) + \frac{1}{2} lo g (\frac{p _{2}}{1 - p _{2}}) = \frac{1}{2} β - \frac{1}{4} ψ + \frac{1}{2} β + \frac{1}{4} ψ = β .

\frac{1}{2} lo g (\frac{p _{1}}{1 - p _{1}}) + \frac{1}{2} lo g (\frac{p _{2}}{1 - p _{2}}) = \frac{1}{2} β - \frac{1}{4} ψ + \frac{1}{2} β + \frac{1}{4} ψ = β .

lo g (\frac{\frac{p _{2}}{1 - p _{2}}}{\frac{p _{1}}{1 - p _{1}}}) = lo g (\frac{p _{2}}{1 - p _{2}}) - lo g (\frac{p _{1}}{1 - p _{1}}) = β + \frac{ψ}{2} - (β - \frac{ψ}{2}) = ψ .

lo g (\frac{\frac{p _{2}}{1 - p _{2}}}{\frac{p _{1}}{1 - p _{1}}}) = lo g (\frac{p _{2}}{1 - p _{2}}) - lo g (\frac{p _{1}}{1 - p _{1}}) = β + \frac{ψ}{2} - (β - \frac{ψ}{2}) = ψ .

P (Λ \leq λ)

P (Λ \leq λ)

= P (p_{2} \leq λ p_{1})

= P \frac{1}{1 + exp ( - β - \frac{ψ}{2} )} \leq \frac{λ}{1 + exp ( - β + \frac{ψ}{2} )} .

P ((exp (\frac{ψ}{2}))^{2} + (1 - λ) exp (β) exp (\frac{ψ}{2}) - λ \leq 0) .

P ((exp (\frac{ψ}{2}))^{2} + (1 - λ) exp (β) exp (\frac{ψ}{2}) - λ \leq 0) .

(exp (\frac{ψ}{2}))^{2} + (1 - λ) exp (β) exp (\frac{ψ}{2}) - λ = 0,

(exp (\frac{ψ}{2}))^{2} + (1 - λ) exp (β) exp (\frac{ψ}{2}) - λ = 0,

exp (\frac{ψ}{2}) = \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2},

exp (\frac{ψ}{2}) = \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2},

ψ = 2 lo g \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2} .

ψ = 2 lo g \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2} .

ψ \leq 2 lo g \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2} .

ψ \leq 2 lo g \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2} .

= = P ψ \leq 2 lo g \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2} \int_{- \infty}^{\infty} \int_{- \infty}^{2 l o g (\frac{- ( 1 - λ ) e x p ( β ) + ( 1 - λ ) ^{2} e x p ( 2 β ) + 4 λ}{2})} N (ψ; μ_{ψ}, σ_{ψ}^{2}) N (β; μ_{β}, σ_{β}^{2}) d ψ d β \int_{- \infty}^{\infty} N (β; μ_{β}, σ_{β}^{2}) Φ 2 lo g \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2}; μ_{ψ}, σ_{ψ}^{2} d β,

= = P ψ \leq 2 lo g \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2} \int_{- \infty}^{\infty} \int_{- \infty}^{2 l o g (\frac{- ( 1 - λ ) e x p ( β ) + ( 1 - λ ) ^{2} e x p ( 2 β ) + 4 λ}{2})} N (ψ; μ_{ψ}, σ_{ψ}^{2}) N (β; μ_{β}, σ_{β}^{2}) d ψ d β \int_{- \infty}^{\infty} N (β; μ_{β}, σ_{β}^{2}) Φ 2 lo g \frac{- ( 1 - λ ) exp ( β ) + ( 1 - λ ) ^{2} exp ( 2 β ) + 4 λ}{2}; μ_{ψ}, σ_{ψ}^{2} d β,

\begin{split}\frac{d}{d\lambda}&\left[\int_{-\infty}^{\infty}\mathcal{N}(\beta;\mu_{\beta},\sigma_{\beta}^{2})\,\Phi\left(2\log\left(\frac{-\left(1-\lambda\right)\exp(\beta)+\sqrt{\left(1-\lambda\right)^{2}\exp(2\beta)+4\lambda}}{2}\right);\mu_{\psi},\sigma_{\psi}^{2}\right)\text{d}\beta\right]\\ &=\int_{-\infty}^{\infty}\mathcal{N}(\beta;\mu_{\beta},\sigma_{\beta}^{2})\,\mathcal{N}\left(2\log\left(\frac{-\left(1-\lambda\right)\exp(\beta)+\sqrt{\left(1-\lambda\right)^{2}\exp(2\beta)+4\lambda}}{2}\right);\mu_{\psi},\sigma_{\psi}^{2}\right)\\ &\hskip 20.00003pt\times 2\Bigg{[}\frac{\exp(\beta)+\frac{2-(1-\lambda)\exp(2\beta)}{\sqrt{(1-\lambda)^{2}\exp(2\beta)+4\lambda}}}{-(1-\lambda)\exp(\beta)+\sqrt{(1-\lambda)^{2}\exp(2\beta)+4~\lambda}}\Bigg{]}\text{d}\beta.\end{split}

\begin{split}\frac{d}{d\lambda}&\left[\int_{-\infty}^{\infty}\mathcal{N}(\beta;\mu_{\beta},\sigma_{\beta}^{2})\,\Phi\left(2\log\left(\frac{-\left(1-\lambda\right)\exp(\beta)+\sqrt{\left(1-\lambda\right)^{2}\exp(2\beta)+4\lambda}}{2}\right);\mu_{\psi},\sigma_{\psi}^{2}\right)\text{d}\beta\right]\\ &=\int_{-\infty}^{\infty}\mathcal{N}(\beta;\mu_{\beta},\sigma_{\beta}^{2})\,\mathcal{N}\left(2\log\left(\frac{-\left(1-\lambda\right)\exp(\beta)+\sqrt{\left(1-\lambda\right)^{2}\exp(2\beta)+4\lambda}}{2}\right);\mu_{\psi},\sigma_{\psi}^{2}\right)\\ &\hskip 20.00003pt\times 2\Bigg{[}\frac{\exp(\beta)+\frac{2-(1-\lambda)\exp(2\beta)}{\sqrt{(1-\lambda)^{2}\exp(2\beta)+4\lambda}}}{-(1-\lambda)\exp(\beta)+\sqrt{(1-\lambda)^{2}\exp(2\beta)+4~\lambda}}\Bigg{]}\text{d}\beta.\end{split}

P (Υ \leq υ)

P (Υ \leq υ)

= P (p_{2} \leq υ + p_{1})

= P \frac{1}{1 + exp ( - β - \frac{ψ}{2} )} \leq υ + \frac{1}{1 + exp ( - β + \frac{ψ}{2} )} .

P (exp (β) (1 - υ) (exp (\frac{ψ}{2}))^{2} - υ (exp (2 β) + 1) exp (\frac{ψ}{2}) - exp (β) (υ + 1) \leq 0) .

P (exp (β) (1 - υ) (exp (\frac{ψ}{2}))^{2} - υ (exp (2 β) + 1) exp (\frac{ψ}{2}) - exp (β) (υ + 1) \leq 0) .

exp (β) (1 - υ) (exp (\frac{ψ}{2}))^{2} - υ (exp (2 β) + 1) exp (\frac{ψ}{2}) - exp (β) (υ + 1) = 0,

exp (β) (1 - υ) (exp (\frac{ψ}{2}))^{2} - υ (exp (2 β) + 1) exp (\frac{ψ}{2}) - exp (β) (υ + 1) = 0,

exp (\frac{ψ}{2})

exp (\frac{ψ}{2})

ψ = 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )} .

ψ = 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )} .

ψ \leq 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )} .

ψ \leq 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )} .

= = P ψ \leq 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )} \int_{- \infty}^{\infty} \int_{- \infty}^{2 l o g (\frac{υ ( e x p ( 2 β ) + 1 ) + υ ^{2} ( e x p ( 2 β ) - 1 ) ^{2} + 4 e x p ( 2 β )}{2 e x p ( β ) ( 1 - υ )})} N (ψ; μ_{ψ}, σ_{ψ}^{2}) N (β; μ_{β}, σ_{β}^{2}) d ψ d β \int_{- \infty}^{\infty} N (β; μ_{β}, σ_{β}^{2}) Φ 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )}; μ_{ψ}, σ_{ψ}^{2} d β .

= = P ψ \leq 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )} \int_{- \infty}^{\infty} \int_{- \infty}^{2 l o g (\frac{υ ( e x p ( 2 β ) + 1 ) + υ ^{2} ( e x p ( 2 β ) - 1 ) ^{2} + 4 e x p ( 2 β )}{2 e x p ( β ) ( 1 - υ )})} N (ψ; μ_{ψ}, σ_{ψ}^{2}) N (β; μ_{β}, σ_{β}^{2}) d ψ d β \int_{- \infty}^{\infty} N (β; μ_{β}, σ_{β}^{2}) Φ 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )}; μ_{ψ}, σ_{ψ}^{2} d β .

\frac{d}{d υ} \int_{- \infty}^{\infty} N (β; μ_{β}, σ_{β}^{2}) Φ 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )}; μ_{ψ}, σ_{ψ}^{2} d β = \int_{- \infty}^{\infty} N (β; μ_{β}, σ_{β}^{2}) N 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )}; μ_{ψ}, σ_{ψ}^{2} \times 2 \frac{exp ( 2 β ) + \frac{υ ( e x p ( 2 β ) - 1 ) ^{2}}{υ ^{2} ( e x p ( 2 β ) - 1 ) ^{2} + 4 e x p ( 2 β )} + 1}{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )} + \frac{1}{1 - υ} d β .

\frac{d}{d υ} \int_{- \infty}^{\infty} N (β; μ_{β}, σ_{β}^{2}) Φ 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )}; μ_{ψ}, σ_{ψ}^{2} d β = \int_{- \infty}^{\infty} N (β; μ_{β}, σ_{β}^{2}) N 2 lo g \frac{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )}{2 exp ( β ) ( 1 - υ )}; μ_{ψ}, σ_{ψ}^{2} \times 2 \frac{exp ( 2 β ) + \frac{υ ( e x p ( 2 β ) - 1 ) ^{2}}{υ ^{2} ( e x p ( 2 β ) - 1 ) ^{2} + 4 e x p ( 2 β )} + 1}{υ ( exp ( 2 β ) + 1 ) + υ ^{2} ( exp ( 2 β ) - 1 ) ^{2} + 4 exp ( 2 β )} + \frac{1}{1 - υ} d β .

lo g (\frac{p _{1}}{1 - p _{1}})

lo g (\frac{p _{1}}{1 - p _{1}})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Informed Bayesian Inference for the A/B Test

Quentin F. Gronau

University of Amsterdam &Akash Raj K. N.

University of Amsterdam &Eric-Jan Wagenmakers

University of Amsterdam

[email protected]

\Plainauthor

Quentin F. Gronau, Akash Raj K. N., Eric-Jan Wagenmakers \PlaintitleInformed Bayesian Inference for the A/B Test \Shorttitle\pkgabtest \AbstractBooming in business and a staple analysis in medical trials, the A/B test assesses the effect of an intervention or treatment by comparing its success rate with that of a control condition. Across many practical applications, it is desirable that (1) evidence can be obtained in favor of the null hypothesis that the treatment is ineffective; (2) evidence can be monitored as the data accumulate; (3) expert prior knowledge can be taken into account. Most existing approaches do not fulfill these desiderata. Here we describe a Bayesian A/B procedure based on Kass and Vaidyanathan (1992) that allows one to monitor the evidence for the hypotheses that the treatment has either a positive effect, a negative effect, or, crucially, no effect. Furthermore, this approach enables one to incorporate expert knowledge about the relative prior plausibility of the rival hypotheses and about the expected size of the effect, given that it is non-zero. To facilitate the wider adoption of this Bayesian procedure we developed the \pkgabtest package in \proglangR. We illustrate the package options and the associated statistical results with a fictitious business example and a real data medical example. \Keywordsmodel comparison, Bayes factor, prior elicitation, Bayesian estimation \Plainkeywordsmodel comparison, Bayes factor, prior elicitation, Bayesian estimation \Address Quentin F. Gronau

Department of Psychological Methods

University of Amsterdam

Nieuwe Achtergracht 129 B

1018 WT Amsterdam, The Netherlands

E-mail:

1 Introduction

Does the modification of a company website increase the number of online purchases? Does a new drug result in a lower mortality rate? These are just two examples of the kinds of questions that can be addressed with A/B testing, a procedure popular not only in business and medical clinical trials, but also in fields such as psychology, neuroscience, and biology. The A/B test set-up discussed in this article assumes that the outcome variable is binary; nevertheless, the outcome variable could in principle also be continuous. Based on a binary outcome variable, an A/B test compares the success rate of two options or treatment arms, A and B, and therefore can be conceptualized as a test for a difference between two proportions (Little, 1989). Typically, options A and B correspond to a control condition and an intervention or treatment of interest.

Regardless of the specific field of application, we believe three general desiderata for A/B tests can be identified. First, we believe it is desirable that evidence can be obtained in favor of the null hypothesis that there is no difference between options A and B. For instance, suppose a programmer alters code that should leave the appearance of a website unaffected. An A/B test may be conducted to confirm that the code changes did not lead to unintended consequences. Alternatively, suppose that a cheaper drug is introduced as a replacement of the standard drug; here, an A/B test may confirm that the cheaper drug is as effective as the drug that is currently standard.

Second, we believe it is desirable that evidence can be monitored as the data accumulate. Data collection can be time-consuming and expensive, and interim tests allow one to assess whether the results in hand are already sufficiently compelling or whether additional data ought to be obtained. There is also an ethical aspect to this desideratum, one that is particularly pronounced in case of new clinical treatments that are potentially beneficial or harmful; it is unethical to withhold treatment that interim analysis shows to be beneficial, just as it is unethical to continue to administer a treatment that interim analysis shows to be harmful (e.g., Armitage, 1960; see also Ware, 1989 and the accompanying discussion).

Third, we believe it is desirable that expert knowledge can be taken into account (e.g., O’Hagan, 2019). In many A/B testing applications, there exists considerable expert knowledge about what size of effect to expect. For instance, the effect of website changes on conversion rates is often less than 0.5% (Berman et al., 2018). Incorporating such expert knowledge into the statistical analysis will yield a more targeted test.

The majority of A/B testing procedures that are currently in vogue do not fulfill the above desiderata. Specifically, many companies apply standard $p$ -value-based null hypothesis significance testing to assess whether or not options A and B differ. This procedure has the advantage that it is readily available in software such as \proglangR (\proglangR Core Team, 2019, e.g., via the functions \codeprop.test, \codefisher.test, and \codechisq.test). However, this approach cannot distinguish between absence of evidence (i.e., the data are inconclusive) and evidence of absence (i.e., the data provide support for the null hypothesis that options A and B do not differ; Dienes, 2014; Keysers et al., 2020, e.g.,). Furthermore, although common practice, sequentially monitoring the uncorrected $p$ -value (and stopping data collection as soon as the $p$ -value is smaller than some fixed $\alpha$ -level) invalidates the analysis (e.g., Feller, 1940). However, there exist valid classical sequential procedures that enable one to monitor a corrected $p$ -value as data accumulate (e.g., Malek et al., 2017). For instance, Optimizely, one of the leading commercial A/B testing platforms, has recently implemented an alternative $p$ -value-based approach that allows users to continuously monitor the test outcome (Johari et al., 2017). Nevertheless, these sequential $p$ -value-based procedures retain the inability to quantify evidence for the absence of an effect. Furthermore, (sequential) $p$ -value-based A/B testing does not allow one to incorporate expert knowledge into the statistical analysis in a straightforward manner.

An alternative A/B testing approach that has become more popular of late is Bayesian estimation. For instance, VWO, another leading A/B testing platform, has recently implemented a Bayesian estimation approach (Stucchio, 2015). A Bayesian estimation approach is also available via the \pkgBayesianFirstAid package (Bååth, 2014) and the \pkgbayesAB package (Portman, 2019).111The \pkgbayesAB package provides a range of functions for Bayesian A/B testing. One advantage is that users can choose from a range of different data distributions (e.g., Bernoulli, normal, Poisson, etc.). Since Bayesian inference does not require sample sizes to be fixed a priori (Berger and Wolpert, 1988), this approach allows one to monitor the analysis output as data accumulate. A Bayesian estimation approach also enables the incorporation of expert knowledge via the specification of a prior distribution that captures the expert’s knowledge about a parameter of interest. However, this approach operates under the assumption that an effect exists –since a continuous prior assigns zero probability to a single null value– and consequently does not allow one to obtain evidence in favor of the null hypothesis of no effect. For instance, \pkgbayesAB and \pkgBayesianFirstAid provide the user with the posterior probability that one option yields more successes than the other, but this ignores the fact that both options could be equally effective. Furthermore, the currently used Bayesian estimation approaches –such as the one implemented in \pkgbayesAB and \pkgBayesianFirstAid– typically assign independent priors to the success probabilities of the control and treatment condition, a practice that was critiqued by Howard (1998).222“do English or Scots cattle have a higher proportion of cows infected with a certain virus? Suppose we were informed (before collecting any data) that the proportion of English cows infected was $0.8$ . With independent uniform priors we would now give $H_{1}$ ( $p_{1}>p_{2}$ ) a probability of $0.8$ (because the chance that $p_{2}>0.8$ is still $0.2$ ). In very many cases this would not be appropriate. Often we will believe (for example) that if $p_{1}$ is 80%, $p_{2}$ will be near 80% as well and will be almost equally likely to be larger or smaller.” (p. 363)

To overcome the limitations of the current A/B tests we developed the \pkgabtest package in \proglangR (\proglangR Core Team, 2019). The \pkgabtest package implements one form of Bayesian inference for the A/B test, using informed prior distributions that induce a dependency between the two success probabilities. The analysis approach is based on a model by Kass and Vaidyanathan (1992); for alternative approaches see Deng et al. (2016), Jamil et al. (2017), Pham-Gia et al. (2017), and Skorski (2019). The implemented Bayesian procedure allows users (1) to obtain evidence in favor of the null hypothesis (e.g., Berger and Delampady, 1987; Wagenmakers et al., 2018); (2) monitor the evidence as the data accumulate (e.g., Rouder, 2014); and (3) elicit and incorporate expert prior knowledge (e.g., O’Hagan, 2019). The \pkgabtest package thus fulfills all three desiderata mentioned above.

The \pkgabtest package provides functionality for both hypothesis testing and parameter estimation. In line with Jeffreys (1939) and Fisher (1928), we believe that testing and estimation are complementary activities (Haaf et al., 2019): before a parameter is estimated, it should be tested whether there is anything to justify estimation at all. Jeffreys (1939, p. 345) related this principle to Occam’s razor: “variation must be taken as random until there is positive evidence to the contrary” (see also Kass and Raftery, 1995, Section 8.1). However, some researchers and practitioners oppose this idea, for instance because they believe that one should replace hypothesis testing with parameter estimation (Gelman and Rubin, 1995, e.g.,; Cumming, 2014). Nevertheless, the \pkgabtest package may also be useful for researchers without an interest in hypothesis testing, since the package can also be used exclusively for Bayesian parameter estimation (and prior elicitation).

This article is organized as follows: The next section introduces a fictitious business example. Afterwards, the implementation details of the Bayesian A/B test procedure used in \pkgabtest are discussed. Subsequently, the fictitious example is continued and the functionality of the \pkgabtest package and the practical benefits of the implemented approach are demonstrated. Next, a real data medical example is used to demonstrate further functionality of the package. The article ends with concluding comments.

2 Example 1: effectiveness of resilience training

Suppose the managers of a large consultancy firm are interested in reducing the number of employees who quit within the first six months, possibly due to the high stress involved in the job. A coaching company offers a resilience training and claims that this training greatly reduces the number of employees who quit. Implementing the training for all newly hired employees would be expensive and some of the managers are not completely convinced that the training is at all effective. Therefore, the managers decide to run an A/B test where half of a sample of newly hired employees will receive the training, the other half will not be trained. The outcome variable is whether or not an employee quit within the first six months (1 = still on the job, 0 = quit).

The consultancy firm collects $1,000$ observations ( $500$ in each group). These (fictitious) data333The data set is structured such that the sequential nature of the data is retained: the data set contains the number of observations and the number of successes in each of the two groups after each observation. are included in the \pkgabtest package (i.e., \codeseqdata). The number of employees still on the job after six months is $249$ in the group without training and $269$ in the trained group. Figure 1 provides an illustration of some of the information that can be obtained by analyzing these data using \pkgabtest. The figure displays the probability of the hypothesis that the training has a positive effect (i.e., $\mathcal{H}_{+}$ ), negative effect (i.e., $\mathcal{H}_{-}$ ), and no effect (i.e., $\mathcal{H}_{0}$ ) as a function of the number of observations across the two groups. The top part of the figure displays the probability of the three hypotheses before and after taking into account the observed data (i.e., prior and posterior probabilities) as probability wheels (e.g., Tversky, 1969; Lipkus and Hollands, 1999). Before providing more details about how to obtain and interpret this result as well as providing additional analyses, we discuss the implementation details of the A/B test procedure used by \pkgabtest.

3 Implementation details

The Bayesian A/B test implemented in the \pkgabtest package is based on Kass and Vaidyanathan (1992, Section 3, “Testing Equality of Two Binomial Proportions”). Appendix A-C provide detailed derivations.

3.1 Model

Let $y_{1}$ denote the number of successes for option A with $n_{1}$ denoting the corresponding total number of observations for option A. Similarly, $y_{2}$ denotes the number of successes for option B with $n_{2}$ denoting the corresponding total number of observations for option B. The Bayesian A/B test model based on Kass and Vaidyanathan (1992) is specified as follows:444Note that this is equivalent to a logistic regression model with a binary covariate (i.e., group membership) that is coded using $\pm 0.5$ .

[TABLE]

Therefore, the model assumes that $y_{1}$ and $y_{2}$ follow binomial distributions with success probabilities $p_{1}$ and $p_{2}$ . These probabilities are functions of the two model parameters, $\beta$ and $\psi$ . Specifically, the log odds corresponding to $p_{1}$ are given by $\beta-\psi/2$ and the log odds corresponding to $p_{2}$ are given by $\beta+\psi/2$ . The nuisance parameter $\beta$ corresponds to the grand mean of the log odds and the test-relevant parameter $\psi$ corresponds to the log odds ratio. When $\psi$ is positive, this implies that $p_{2}>p_{1}$ (i.e., option B has a higher success probability than option A); when $\psi$ is negative this implies that $p_{2}<p_{1}$ (i.e., option B has a lower success probability than option A).

3.2 Hypotheses

The \pkgabtest package enables both estimation of the model parameters and testing of hypotheses about the test-relevant log odds ratio parameter $\psi$ . There are four hypotheses that are of potential interest:

The null hypothesis $\mathcal{H}_{0}$ which states that the success probabilities $p_{1}$ and $p_{2}$ are identical, that is, $p_{1}=p_{2}$ . This is equivalent to $\mathcal{H}_{0}:\psi=0$ . This hypothesis corresponds to the claim that there is no difference between options A and B (i.e., the “A/A test”). 2. 2.

The two-sided alternative hypothesis $\mathcal{H}_{1}$ which states that the two success probabilities $p_{1}$ and $p_{2}$ are not equal (i.e., $p_{1}\neq p_{2}$ ), but does not specify which of the two is larger. This is equivalent to $\mathcal{H}_{1}:\psi\neq 0$ . This hypothesis corresponds to the claim that options A and B differ but it is not specified which one yields more successes. 3. 3.

The one-sided hypothesis $\mathcal{H}_{+}$ which states that the second success probability $p_{2}$ is larger than the first success probability $p_{1}$ . This is equivalent to $\mathcal{H}_{+}:\psi>0$ . This hypothesis corresponds to the claim that option B yields more successes than option A. 4. 4.

The one-sided hypothesis $\mathcal{H}_{-}$ which states that the first success probability $p_{1}$ is larger than the second success probability $p_{2}$ . This is equivalent to $\mathcal{H}_{-}:\psi<0$ . This hypothesis corresponds to the claim that option A yields more successes than option B.

Researchers who conduct an A/B test are usually interested in answering the question: Does option B yield more successes than option A (i.e., $\mathcal{H}_{+}$ ), fewer successes than option A (i.e., $\mathcal{H}_{-}$ ), or is there no difference between options A and B (i.e., $\mathcal{H}_{0}$ )? Therefore, it may be argued that the hypotheses of interest are typically $\mathcal{H}_{+}$ , $\mathcal{H}_{-}$ , and $\mathcal{H}_{0}$ . Consequently, by default, only these three hypotheses are assigned non-zero prior probability in the \pkgabtest package. Specifically, a default prior probability of $.50$ is assigned to the hypothesis that there is no effect (i.e., $\mathcal{H}_{0}$ ), and the remaining prior probability is split evenly across the hypothesis that there is a positive effect (i.e., $\mathcal{H}_{+}$ receives $.25$ ) and a negative effect (i.e., $\mathcal{H}_{-}$ also receives $.25$ ). The user may change these default prior probabilities to custom values.

Table 1 provides an overview of five qualitatively different tests that can be conducted by assigning prior probabilities to hypotheses in certain ways.555Note that, except for the first column of Table 1 which displays the default setting, the remaining examples use equal prior probabilities for all hypotheses that are assigned non-zero prior probability. However, the user can of course also assign prior probability unevenly to the hypotheses of interest (e.g., if prior knowledge exists about the relative plausibility of the rival hypotheses). The first column displays the default setting that assigns probability $.50$ to the null hypothesis and splits the remaining probability evenly across $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ . The second column displays a prior probability assignment that implements an undirected test (i.e., $\mathcal{H}_{0}$ is compared to the undirected $\mathcal{H}_{1}$ ). The third column displays a prior probability assignment for testing whether the effect is non-existent or positive. The fourth column displays a prior probability assignment for testing whether the effect is non-existent or negative. Finally, the fifth column displays a prior probability assignment for a test of direction, that is, for testing whether the effect is positive or negative. This last setting may be of interest whenever the null hypothesis is a priori deemed implausible, uninteresting, or irrelevant.

3.3 Parameter priors

The \pkgabtest package assigns normal priors to the model parameters: $\beta\sim\mathcal{N}(\mu_{\beta},\sigma_{\beta}^{2})$ and $\psi\sim\mathcal{N}(\mu_{\psi},\sigma_{\psi}^{2})$ . As illustrated in the example below, these priors result in a dependency in the implied prior for the success probabilities $p_{1}$ and $p_{2}$ , which is generally desirable (Howard, 1998).

For the one-sided hypotheses $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ , the prior on $\psi$ is truncated at zero. Specifically, for $\mathcal{H}_{+}$ , the prior on $\psi$ is a truncated normal distribution with parameters $\mu_{\psi}$ and $\sigma_{\psi}$ and lower bound at zero. For $\mathcal{H}_{-}$ , the prior on $\psi$ is a truncated normal distribution with parameters $\mu_{\psi}$ and $\sigma_{\psi}$ and upper bound at zero. These normal priors are computationally convenient and sufficiently flexible to encode a wide range of prior information.

By default, the \pkgabtest package assigns standard normal priors to both $\beta$ and $\psi$ . For the nuisance parameter $\beta$ , a standard normal prior results in a relatively flat implied prior on $p_{1}$ and $p_{2}$ when $\psi=0$ . Generally, the choice of a prior for the nuisance parameter $\beta$ is relatively inconsequential (Kass and Vaidyanathan, 1992). In contrast, the prior on the test-relevant parameter $\psi$ is consequential, as it defines the extent to which the hypotheses of interest differ from $\mathcal{H}_{0}$ . Our choice for a default standard normal prior on the test-relevant parameter $\psi$ is motivated by the fact that a zero-centered prior does not favor any of the two options A or B a priori. Furthermore, the standard deviation of 1 results in a prior distribution that assigns mass to a wide range of reasonable log odds ratios (Chen et al., 2010) without being so uninformative that the results unduly favor $\mathcal{H}_{0}$ (Bartlett, 1957; Lindley, 1957).666Note that the default implied prior on the absolute risk $p_{2}-p_{1}$ is considerably more narrow than the prior induced by the popular default choice that assigns $p_{1}$ and $p_{2}$ independent uniform distributions (Jeffreys, 1935). However, large changes in the prior standard deviation of the test-relevant parameter may result in large changes in the results, as the prior standard deviation governs the degree to which the hypothesis of interest makes predictions that differ from $\mathcal{H}_{0}$ . To include prior knowledge about the expected results, the \pkgabtest package allows the user to change the default values of the prior distributions for the nuisance parameter $\beta$ and the test-relevant parameter $\psi$ , either by changing the location of the normal prior distribution, the scale, or both.

3.4 Encoding prior information

A straightforward way to encode prior information about the model parameters is to set $\mu_{\beta}$ , $\sigma_{\beta}$ , $\mu_{\psi}$ , and $\sigma_{\psi}$ directly. However, it may sometimes be easier to specify prior distributions based on quantities such as the (log) odds ratio, relative risk (i.e., $p_{2}/p_{1}$ , the ratio of the success probability in condition B and condition A), and absolute risk (i.e., $p_{2}-p_{1}$ , the difference of the success probability in condition B and condition A). The \codeelicit_prior function allows users to encode prior information about a quantity of interest (either log odds ratio, odds ratio, relative risk, or absolute risk). The function assumes that the prior on $\beta$ is not the primary target of prior elicitation and is fixed by the user a priori (using the arguments \codemu_beta and \codesigma_beta) – for instance, to a standard normal prior which corresponds to a relatively flat implied prior on $p_{1}$ and $p_{2}$ when $\psi=0$ .

To encode prior information, the user needs to provide quantiles for a quantity of interest. Let $q_{i},i=1,\ldots,I$ denote the values of $I$ quantiles provided by the user and let $\text{prob}_{i},i=1,\ldots,I$ denote the corresponding probabilities (e.g., for the median, $\text{prob}_{i}=0.5$ ). Least-squares minimization is used to obtain $\mu_{\psi}$ and $\sigma_{\psi}$ as follows:

[TABLE]

where $F(\cdot;\mu_{\psi},\sigma_{\psi})$ corresponds to the cumulative distribution function (cdf) for the quantity of interest implied by the normal prior on $\psi$ . For some quantities, this cdf also depends on the prior for $\beta$ ; however, as described above, it is assumed that $\mu_{\beta}$ and $\sigma_{\beta}$ are fixed a priori.

3.5 Hypothesis testing

To quantify the evidence that the data provide for $\mathcal{H}_{0}$ , $\mathcal{H}_{1}$ , $\mathcal{H}_{+}$ , and $\mathcal{H}_{-}$ , one can compute Bayes factors (Jeffreys, 1939; Kass and Raftery, 1995) and posterior probabilities of the rival hypotheses. The posterior probability of hypothesis $\mathcal{H}_{j}$ , $j\in\{0,1,+,-\}$ is given by:

[TABLE]

The Bayes factor for comparing hypotheses $\mathcal{H}_{j}$ and $\mathcal{H}_{k}$ equals the change from prior to posterior odds:

[TABLE]

In order to obtain posterior probabilities of the hypotheses and Bayes factors one needs to evaluate the marginal likelihood $p(\text{data}\mid\mathcal{H}_{j})$ for each hypothesis $j\in\{0,1,+,-\}$ . For $\mathcal{H}_{0}$ and $\mathcal{H}_{1}$ , we evaluate the marginal likelihood using Laplace approximations as suggested by Kass and Vaidyanathan (1992). Specifically, the marginal likelihood for $\mathcal{H}_{0}$ is approximated by:

[TABLE]

where $l_{0}^{\ast}(\beta)=\log\left\{p(\text{data}\mid\beta)\,\pi_{0}(\beta)\right\}$ , $\beta_{0}^{\ast}$ corresponds to the mode of $l_{0}^{\ast}(\beta)$ , and $\sigma_{0}^{2}=\left(-\frac{d^{2}}{d\beta^{2}}\,l_{0}^{\ast}(\beta)\right)^{-1}\bigg{\rvert}_{\beta=\beta_{0}^{\ast}}$ denotes the inverse of the negative second derivative of $l_{0}^{\ast}(\beta)$ evaluated at the mode $\beta_{0}^{\ast}$ .

The marginal likelihood for $\mathcal{H}_{1}$ is approximated by:

[TABLE]

where $l^{\ast}(\beta,\psi)=\log\left\{p(\text{data}\mid\beta,\psi)\,\pi(\beta,\psi)\right\}$ , $(\beta^{\ast},\psi^{\ast})$ denotes the mode of $l^{\ast}(\beta,\psi)$ , and $\bm{\Sigma}_{1}=\left(-\bm{H}_{1}\right)^{-1}\big{\rvert}_{(\beta,\psi)=(\beta^{\ast},\psi^{\ast})}$ denotes the inverse of the negative Hessian $\bm{H}_{1}$ (i.e., the matrix with second-order partial derivatives) of $l^{\ast}(\beta,\psi)$ evaluated at the mode $(\beta^{\ast},\psi^{\ast})$ .

These Laplace approximations work well in practice, even for sample sizes that are extremely small. As a demonstration, for a range of synthetic data sets we computed the (log of the) Bayes factor $\text{BF}_{10}$ which compares $\mathcal{H}_{1}$ to $\mathcal{H}_{0}$ using the above Laplace approximations and, as a comparison, also using bridge sampling (Meng and Wong, 1996; Gronau et al., 2020). The priors on $\beta$ and $\psi$ were standard normal distributions. Figure 2 displays the results and confirms that the Laplace approximation yields accurate results, even for sample sizes as small as $n_{1}=n_{2}=5$ .

For the one-sided hypotheses $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ , Laplace approximations did not appear to yield accurate results for small sample sizes, even after removing the constraint on $\psi$ through the parameterization $(\beta,\xi)=(\beta,\log\left(\psi\right))$ for $\mathcal{H}_{+}$ and $(\beta,\xi)=(\beta,\log\left(-\psi\right))$ for $\mathcal{H}_{-}$ . The \pkgabtest package therefore uses importance sampling to increase the accuracy of the Laplace approximations when computing the marginal likelihoods for $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ . Specifically, a Laplace approximation is used to approximate the mode and covariance matrix of the posterior. The importance density is then given by a multivariate $t$ distribution with location set to the approximated posterior mode, scale matrix set to the approximated posterior covariance matrix, and five degrees of freedom (note that the user can change the degrees of freedom). The marginal likelihood for $\mathcal{H}_{+}$ is then estimated as follows:

[TABLE]

where $\left\{\tilde{\beta}_{s},\tilde{\xi}_{s}\right\}_{s=1}^{S}$ denotes $S$ samples from the multivariate $t$ importance density $g_{\text{is}}$ , and

[TABLE]

where $\mathcal{N}(x;y,z)$ denotes the probability density function of a normal distribution with mean $y$ and variance $z$ that is evaluated at $x$ . Furthermore, $\mathcal{N}_{+}(x;y,z)$ denotes the density of a normal distribution that is truncated to allow only positive values for $x$ . The marginal likelihood for $\mathcal{H}_{-}$ is computed analogously.

3.6 Obtaining posterior samples

In a Bayesian A/B test application, one may not only be interested in testing hypotheses, but also in obtaining posterior samples for the model parameters under $\mathcal{H}_{1}$ , $\mathcal{H}_{+}$ , and $\mathcal{H}_{-}$ . The \pkgabtest package allows the user to obtain posterior samples using sampling importance resampling (e.g., Robert and Casella, 2010). Specifically, posterior samples for $\mathcal{H}_{+}$ are obtained as follows (samples for the other hypotheses are obtained in an analogous manner):

Generate $S$ samples from the multivariate $t$ proposal distribution mentioned before, denoted by $\left\{\tilde{\beta}_{s},\tilde{\xi}_{s}\right\}_{s=1}^{S}$ . 2. 2.

Compute the importance weights:

[TABLE] 3. 3.

Renormalize the importance weights: $v_{s}=w_{s}/\sum_{t=1}^{S}w_{t}$ , $s=1,2,\ldots,S$ . 4. 4.

Resample (with replacement) from the samples obtained from the importance density according to the normalized importance weights $v_{s}$ which yields (approximate) samples from the posterior distribution.

4 Example 1: effectiveness of resilience training (continued)

Next we continue the effectiveness of resilience training example and show how expert prior information can be taken into account, how the hypotheses of interest can be tested, and how one can estimate the model parameters using the \pkgabtest package.

4.1 Prior specification

Before commencing the A/B test, the managers asked the coaching company to specify how effective they believe the training will be. The coaching company claimed that, based on past experience with the training, they expect the proportion of employees who do not quit within the first six months to be 15% larger for the group who received the training, with a 95% uncertainty interval ranging from a 2.5% benefit to a 27.5% benefit. Assuming that the claimed 15% corresponds to the prior median, this expectation corresponds to a median absolute risk (i.e., $p_{2}-p_{1}$ ) of $0.15$ with a 95% uncertainty interval ranging from $0.025$ to $0.275$ . The \codeelicit_prior function can be used to encode this prior information:777All code and plots are also available at https://osf.io/t3ajr/. {Sinput} R> library("abtest") R> prior_par <- elicit_prior(q = c(0.025, 0.15, 0.275),

prob = c(.025, .5, .975),
what = "arisk")

The obtained prior on the absolute risk can be visualized as follows: {Sinput} R> plot_prior(prior_par, what = "arisk")

The resulting graph is shown in the top panel of Figure 3.

The user can also visualize the (implied) prior for other quantities. For instance, the prior on the log odds ratio (middle panel of Figure 3) is obtained as follows: {Sinput} R> plot_prior(prior_par, what = "logor")

The implied prior on the success probabilities $p_{1}$ and $p_{2}$ (bottom panel of Figure 3) is obtained as follows: {Sinput} R> plot_prior(prior_par, what = "p1p2")

The bottom panel of Figure 3 illustrates that there is a dependency between $p_{1}$ and $p_{2}$ which is arguably desirable (Howard, 1998): When one of the success probabilities is very (small) large, it is likely that the other one will also be (small) large.

4.2 Hypothesis testing

Since the number of employees still on the job after six months is $249$ in the group without training and $269$ in the trained group, the observed success probabilities are $\hat{p}_{1}=.498$ in the control group and $\hat{p}_{2}=.538$ in the group that received training. Consequently, the observed success probabilities suggest that there is a positive effect of the training of 4%; however, a statistical analysis is required to assess whether this observed difference is statistically compelling. The \codeab_test function can be used to conduct a Bayesian A/B test as follows: {Sinput} R> data("seqdata") R> set.seed(1) R> ab <- ab_test(data = seqdata, prior_par = prior_par)

This yields the following output: {Soutput} R> print(ab)

Bayesian A/B Test Results:

Bayes Factors:

BF10: 0.1406443 BF+0: 0.13823 BF-0: 0.4920187

Prior Probabilities Hypotheses:

H+: 0.25 H-: 0.25 H0: 0.5

Posterior Probabilities Hypotheses:

H+: 0.0526 H-: 0.1871 H0: 0.7604

The first part of the output presents Bayes factors in favor of the hypotheses $\mathcal{H}_{1}$ , $\mathcal{H}_{+}$ , and $\mathcal{H}_{-}$ , where the reference hypothesis (i.e., denominator of the Bayes factor) is $\mathcal{H}_{0}$ . Since all three Bayes factors are smaller than 1, they all indicate evidence in favor of the null hypothesis of no effect. The next part of the output displays the prior probabilities of the hypotheses with non-zero prior probability. As explained before, the default setting assigns probability $.50$ to the null hypothesis and splits the remaining probability evenly across $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ . The user can change this default setting via the \codeprior_prob argument (e.g., to assign non-zero probability to $\mathcal{H}_{1}$ ). The final part of the output displays the posterior probabilities of the hypotheses with non-zero prior probability. The posterior probability of the null hypothesis $\mathcal{H}_{0}$ indicates that the data have increased the plausibility of the null hypothesis from $.50$ to $.76$ . Furthermore, the data have decreased the plausibility of both $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ .

As an aside, it may appear paradoxical that the data indicate a 4% positive effect of the training and yet the posterior probability of $\mathcal{H}_{-}$ is larger than that of $\mathcal{H}_{+}$ . The reason for this result is that the company’s prior was overly ambitious, and $\mathcal{H}_{+}$ is penalized for having predicted effects that are much too large. Furthermore, note that the test-relevant prior distribution under $\mathcal{H}_{-}$ is obtained by truncating the prior on $\psi$ at zero and renormalizing. Since the company’s prior assigns almost all mass to positive log odds ratio values, renormalizing the negative part of the distribution results in a prior that is highly similar to $\mathcal{H}_{0}$ ; this explains why $\mathcal{H}_{-}$ receives non-trivial posterior probability. These considerations underscore the fact that the outcome of a Bayesian analysis is always relative to the specific set of models (and associated prior distributions) under consideration. Because highly informed priors can exert a large influence on the results, it is generally wise to examine the robustness of the conclusions by executing the default analysis as well. This analysis is reported in Appendix D.

The \pkgabtest package allows users to visualize the posterior probabilities of the hypotheses by means of a probability wheel (Figure 4): {Sinput} R> prob_wheel(ab)

Overall, the data support the hypothesis that the training is ineffective over the company’s hypothesis that the training is highly effective. The Bayes factor for $\mathcal{H}_{0}$ over $\mathcal{H}_{+}$ equals $1/0.138\approx 7.2$ , which indicates moderate evidence (Jeffreys, 1939, Appendix I).

Since the data set is of a sequential nature, it may be of interest to consider not only the result based on all observations, but to conduct also a sequential analysis that tracks the evidential flow as a function of the total number of observations (i.e., the number of observations across both groups). This sequential analysis can be conducted as follows: {Sinput} R> plot_sequential(ab, thin = 4)

Setting the \codethin argument to \code4 indicates that the evidence is computed after every 4 $th$ observation. Thinning can be useful to speed up the analysis in case the data set is very large or in case observations arrive in batches. Figure 1 displays the result of the sequential analysis. The posterior probability of each hypothesis with non-zero prior probability is plotted as a function of the total number of observations. At the top, two probability wheels visualize the prior probabilities of the hypotheses and the posterior probabilities of the hypotheses based on all available data. Figure 1 shows that after some initial fluctuation, adding more observations increased the probability of the null hypothesis that there is no effect of the training.

4.3 Parameter estimation

The data indicate evidence in favor of the null hypothesis versus the hypothesis that the training is highly effective, leaving open the possibility that the training does have an effect, but of a more modest size than the company anticipated. To assess this possibility one may investigate the potential size of the effect under the assumption that the effect is non-zero.888For consistency, we continue this analysis with the company’s prior; an analysis with the less enthusiastic default prior is provided in Appendix D. For parameter estimation, we generally prefer to investigate the posterior distribution for the unconstrained alternative hypothesis $\mathcal{H}_{1}$ ; however, the \pkgabtest package also provides posterior samples and plotting functionality for the constrained hypotheses $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ .

The top panel of Figure 5 displays the posterior distribution for the absolute risk (i.e., $p_{2}-p_{1}$ ) that can be obtained as follows: {Sinput} R> plot_posterior(ab, what = "arisk")

The top panel of Figure 5 shows the prior distribution as a dotted line and the posterior distribution (with 95% central credible interval) as a solid line. The plot indicates that, under the assumption that the difference between the two success probabilities is not exactly zero, it is likely to be smaller than expected: the posterior median is $0.067$ and the 95% central credible interval ranges from $0.011$ to $0.122$ .

The middle panel of Figure 5 displays the posterior distribution for the log odds ratio $\psi$ that can be obtained as follows: {Sinput} R> plot_posterior(ab, what = "logor")

The middle panel of Figure 5 indicates that, given the log odds ratio is not exactly zero, it is likely to be between $0.043$ and $0.492$ , where the posterior median is $0.267$ .

It may also be of interest to consider the marginal posterior distributions of the success probabilities $p_{1}$ and $p_{2}$ . This plot can be produced as follows: {Sinput} R> plot_posterior(ab, what = "p1p2")

The bottom panel of Figure 5 displays the resulting plot. In this example, $p_{1}$ and $p_{2}$ correspond to the probability of still being on the job after six month for the non-trained employees and the employees that received the training, respectively. The bottom panel of Figure 5 indicates that the posterior median for $p_{1}$ is $0.485$ , with 95% credible ranging from $0.443$ to $0.527$ , and the posterior median for $p_{2}$ is $0.551$ , with 95% credible interval ranging from $0.509$ to $0.592$ .

In sum, this fictitious data set offers modest evidence in favor of the null hypothesis which states that the training is not effective over the hypothesis that the training is highly effective; nevertheless, the consultancy firm should probably continue to collect data in order to obtain more compelling evidence before deciding whether or not the training should be implemented. If the true effect is as small as 4%, continued testing will ultimately show compelling evidence for $\mathcal{H}_{+}$ over $\mathcal{H}_{0}$ . Note that continued testing is trivial in the Bayesian framework: the results can simply be updated as new observations arrive.

5 Example 2: progesterone in women with bleeding in early pregnancy

As a second example application of the \pkgabtest package, here we present a reanalysis of a recent medical trial.999This reanalysis is also available on PsyArXiv: Gronau, Q. F., & Wagenmakers, E.–J. (2019). Progesterone in women with bleeding in early pregnancy: Absence of evidence, not evidence of absence. https://psyarxiv.com/etk7g/ Coomarasamy et al. (2019) assessed the effectiveness of progesterone in preventing miscarriages. The number of live births was 74.7% (1513/2025) in the progesterone group and 72.5% (1459/2013) in the placebo group ( $p=.08$ ). The authors concluded: “The incidence of adverse events did not differ significantly between the groups” (Coomarasamy et al., 2019, p. 1815).

This conclusion leaves unaddressed the degree to which the data undercut or support the no-effect hypothesis $\mathcal{H}_{0}$ over the positive-effect hypothesis $\mathcal{H}_{+}$ . To quantify such evidence we can use the \pkgabtest package. A default analysis can be conducted as follows: {Sinput} R> data <- list(y1 = 1459, n1 = 2013, y2 = 1513, n2 = 2025) R> set.seed(1) R> ab <- ab_test(data = data)

This yields the following output: {Soutput} R> print(ab)

Bayesian A/B Test Results:

Bayes Factors:

BF10: 0.259709 BF+0: 0.4866008 BF-0: 0.02796485

Prior Probabilities Hypotheses:

H+: 0.25 H-: 0.25 H0: 0.5

Posterior Probabilities Hypotheses:

H+: 0.1935 H-: 0.0111 H0: 0.7954

A Bayes factor of $\text{BF}_{0+}=1/\text{BF}_{+0}\approx 2$ indicates that there is only weak evidence in favor of the no-effect hypothesis $\mathcal{H}_{0}$ over the positive-effect hypothesis $\mathcal{H}_{+}$ (Jeffreys, 1939). To alleviate concerns about the choice of the prior distribution for the test-relevant log odds ratio parameter $\psi$ one can conduct a prior robustness analysis as follows: {Sinput} R> plot_robustness(ab, bftype = "BF0+")

Note that the \codebftype argument is used to indicate which Bayes factor is plotted (in this case $\text{BF}_{0+}$ ). Figure 6 displays the results and shows that the evidence is weak for all combinations of $\mu_{\psi}\in[0,0.30]$ and $\sigma_{\psi}\in[0.25,1]$ .

In sum, these data neither undercut nor support the progesterone hypothesis in compelling fashion.

6 Concluding comments

In this article, we have introduced the \pkgabtest package that implements both Bayesian hypothesis testing and Bayesian estimation for the A/B test using informed priors. The procedure allows users to (1) obtain evidence in favor of the null hypothesis; (2) monitor the evidence as data accumulate; and (3) elicit and incorporate expert prior distributions. We hope that the provided analysis approach is useful across different fields that apply A/B testing on a routine basis, particularly business and medicine.

We have introduced the approach implemented in \pkgabtest as testing hypotheses of interest about the test-relevant log odds ratio parameter $\psi$ for the model in Equation 1. However, it should be pointed out that an alternative interpretation is to view the procedure as estimating a mixture model, where the mixture components correspond to the different hypotheses of interest, and the mixture weights are given by the prior/posterior probabilities of the hypotheses (e.g., Mitchell and Beauchamp, 1988). This interpretation is illustrated with a fictitious example in Figure 7. For simplicity, the plot assumes that the user has set the prior probabilities of $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ to zero, whereas the prior probabilities of $\mathcal{H}_{1}$ and $\mathcal{H}_{0}$ are both set to .50. The left panel illustrates the mixture representation before having observed any data. Specifically, the height of the spike at zero corresponds to the prior probability of $\mathcal{H}_{0}$ whereas the shape of the slab corresponds to the continuous default prior distribution for $\psi$ under $\mathcal{H}_{1}$ . The maximum height of this continuous distribution corresponds to the prior probability of $\mathcal{H}_{1}$ .101010This scaling method is inspired by the \pkgBAS package (Clyde, 2020). The right panel illustrates the mixture representation after having observed 20 successes out of 40 observations in the control condition and 30 successes out of 40 observations in the experimental condition (these are fictitious data). The height of the spike corresponds to the posterior probability of $\mathcal{H}_{0}$ , and the maximum height of the continuous posterior distribution under $\mathcal{H}_{1}$ (i.e., the slab) corresponds to the posterior probability of $\mathcal{H}_{1}$ . In this fictitious example, the data have decreased the plausibility of $\mathcal{H}_{0}$ and have increased the plausibility of $\mathcal{H}_{1}$ .

Despite the practical benefits that the package offers right now, there are areas for future improvement. For instance, \pkgabtest currently allows users to compare two groups; however, there are applications in which one may be interested in simultaneously comparing more than two groups. Furthermore, at the moment, \pkgabtest expects the outcome variable to be binary. Nevertheless, in certain scenarios, it may be more natural to compare the two groups based on a continuous outcome variable. This scenario resembles an independent samples $t$ -test for which well-established Bayesian procedures exist (e.g., Rouder et al., 2009; Ly et al., 2016) which are available, for instance, in the \pkgBayesFactor package (Morey and Rouder, 2018) and \proglangJASP (\proglangJASP Team, 2020).111111For a list of Bayesian \proglangR packages, see https://cran.r-project.org/web/views/Bayesian.html. Moreover, currently, the \pkgabtest package does not provide functions for generating predictions. Note, however, that users can generate predictions in a straightforward manner themselves based on the posterior samples that are provided by \pkgabtest. The implementation also does not allow users to incorporate utilities explicitly (e.g., Lindley, 1985; for alternative approaches see also Azevedo et al., 2019 and Feit and Berman, 2019). However, again, based on the provided posterior probabilities and posterior samples, users who wish to take into account utilities may do so in a relatively straightforward way. Furthermore, users interested in adjusting the model used in \pkgabtest (e.g., to account for hierarchically-structured data or covariates) are referred to general-purpose Bayesian software such as \proglangStan (Carpenter et al., 2017; \proglangStan Development Team, 2019) and the related \proglangR package \pkgbrms (Bürkner, 2017). In combination with the \pkgbridgesampling package (Gronau et al., 2020), this enables the user to compare custom models using Bayes factors and posterior model probabilities. A more structural limitation of \pkgabtest is that it has been developed to analyze A/B test data, but not to run the A/B test experiment itself.

In sum, A/B testing is ubiquitous in business and medicine. Here we have demonstrated how the \pkgabtest package enables relatively complete Bayesian inference including the capability to obtain support for the null, continuously monitor the results, and elicit and incorporate expert prior knowledge. Hopefully, this approach forms a basis for evidence-based conclusions that will benefit both businesses and patients.

7 Acknowledgements

This research was supported by a Netherlands Organisation for Scientific Research (NWO) grant to QFG (406.16.528) and by an NWO Vici grant to EJW (016.Vici.170.083).

Appendix A Interpretation of the parameters

Here we show that $\beta$ corresponds to the grand mean of the log odds and that $\psi$ corresponds to the log odds ratio (for the model definition, see Equation 1). The nuisance parameter $\beta$ corresponds to the grand mean of the log odds since

[TABLE]

The test-relevant parameter $\psi$ corresponds to the log odds ratio since

[TABLE]

Appendix B Prior elicitation: implied distributions

The prior elicitation approach described in Equation 2 requires the cdf’s for the quantities of interest. Here, we derive the implied cdf’s for these quantities; we also derive the corresponding probability density functions (pdf’s). Additionally, we derive four further implied distributions of interest: the joint pdf of $p_{1}$ and $p_{2}$ , the conditional pdf of $p_{2}$ given $p_{1}$ is fixed to a particular value, the marginal distribution for $p_{1}$ , and the marginal distribution for $p_{2}$ . A few of these expressions will contain a one-dimensional integral which can easily be evaluated using numerical integration.

B.1 Log odds ratio

Since $\psi$ itself corresponds to the log odds ratio, $F(\cdot;\mu_{\psi},\sigma_{\psi})$ corresponds in this case to the cdf of a normal distribution with mean $\mu_{\psi}$ and standard deviation $\sigma_{\psi}$ . The corresponding pdf is the normal probability density function.

B.2 Odds ratio

The implied prior on the odds ratio $\omega=\exp(\psi)$ is a log-normal distribution. Hence, $F(\cdot;\mu_{\psi},\sigma_{\psi})$ corresponds in this case to the cdf of a log-normal distribution with parameters $\mu_{\psi}$ and $\sigma_{\psi}$ . The corresponding pdf is the log-normal probability density function.

B.3 Relative risk

The relative risk is given by $\Lambda=\frac{p_{2}}{p_{1}}$ . We use a capital letter (i.e., $\Lambda$ ) to refer to the random variable and use a lower-case letter (i.e., $\lambda$ ) to refer to a concrete realization. Note that so far, we have abused notation by only using lower-case letters, but it should be clear from the context when we referred to a random variable or a concrete realization. However, for deriving the following cdf, we need the distinction to keep the notation clear. To derive the implied cdf for the relative risk, we proceed as follows:

[TABLE]

Taking reciprocals and some algebra yields

[TABLE]

When we set

[TABLE]

we can solve for $\psi$ using the fact that this is a quadratic equation in $\exp\left(\frac{\psi}{2}\right)$ and we obtain:

[TABLE]

where we took into account that $\exp\left(\frac{\psi}{2}\right)$ needs to be positive (i.e., we omitted the solution corresponding to minus the square root). Hence,

[TABLE]

Therefore, $\left(\exp\left(\frac{\psi}{2}\right)\right)^{2}+\left(1-\lambda\right)\exp(\beta)\exp\left(\frac{\psi}{2}\right)-\lambda\leq 0$ whenever

[TABLE]

Hence, the desired cdf can be written as

[TABLE]

where $\Phi\left(\cdot;\mu_{\psi},\sigma_{\psi}^{2}\right)$ denotes the cdf of a normal distribution with mean $\mu_{\psi}$ and variance $\sigma_{\psi}^{2}$ , and $\mathcal{N}(\cdot;\mu_{\beta},\sigma_{\beta}^{2})$ denotes the corresponding pdf.

The pdf of the relative risk is obtained by taking the derivative with respect to $\lambda$ :

[TABLE]

B.4 Absolute risk

The absolute risk is given by $\Upsilon=p_{2}-p_{1}$ . We use the upper-case letter $\Upsilon$ to refer to the random variable and the lower-case letter $\upsilon$ to refer to a concrete realization. To derive the implied cdf for the absolute risk, we proceed as follows:

[TABLE]

After some algebra, we obtain

[TABLE]

When we set

[TABLE]

we can solve for $\psi$ using the fact that this is a quadratic equation in $\exp\left(\frac{\psi}{2}\right)$ and we obtain:

[TABLE]

where we took into account that $\exp\left(\frac{\psi}{2}\right)$ needs to be positive (i.e., we omitted the solution corresponding to minus the square root). Hence,

[TABLE]

Therefore, $\exp\left(\beta\right)\left(1-\upsilon\right)\left(\exp\left(\frac{\psi}{2}\right)\right)^{2}-\upsilon\left(\exp\left(2\beta\right)+1\right)\exp\left(\frac{\psi}{2}\right)-\exp\left(\beta\right)\left(\upsilon+1\right)\leq 0$ whenever

[TABLE]

Hence, the desired cdf can be written as

[TABLE]

The pdf of the absolute risk is obtained by taking the derivative with respect to $\upsilon$ :

[TABLE]

B.5 Joint distribution of $p_{1}$ and $p_{2}$

Another distribution of interest is the implied joint distribution of the two success probabilities $p_{1}$ and $p_{2}$ . This distribution will not be used to elicit the prior on $\psi$ which is the reason why we only derive the pdf and not the cdf. The model parameters $\beta$ and $\psi$ are related to $p_{1}$ and $p_{2}$ as follows:

[TABLE]

Hence, the inverse transformation is given by:

[TABLE]

The corresponding Jacobian is:

[TABLE]

Therefore, the joint pdf of $p_{1}$ and $p_{2}$ is given by:

[TABLE]

B.6 Marginal distribution of $p_{1}$

The marginal distribution of $p_{1}$ is given by:

[TABLE]

B.7 Marginal distribution of $p_{2}$

The marginal distribution of $p_{2}$ is given by:

[TABLE]

B.8 Conditional distribution of $p_{2}$ given $p_{1}$

Another distribution of interest is the conditional distribution of the second success probability $p_{2}$ given a particular value of $p_{1}$ . This distribution will not be used for prior elicitation which is the reason why we only present the expression for the pdf which is given by:

[TABLE]

B.9 Implied distributions for truncated priors on the log odds ratio

Note that the above expressions can be all easily modified in case the prior on the log odds ratio $\psi$ is a truncated normal distribution (e.g., restricting $\psi$ to be larger/smaller than zero) which is the case for the hypotheses $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ . In this case, the normal prior density function and cumulative distribution function for $\psi$ simply need to be replaced by the truncated versions. For the implied log-normal prior on the odds ratio, the truncation bounds simply need to be exponentiated to obtain the truncation bounds with respect to the log-normal prior.

Appendix C Laplace approximation details

The Laplace approximations require first-order and second-order derivatives. Let us first state explicitly the functions for which we need to find the derivatives. For $\mathcal{H}_{0}$ we have:

[TABLE]

For $\mathcal{H}_{1}$ we have:

[TABLE]

For $\mathcal{H}_{+}$ we have:

[TABLE]

Finally, for $\mathcal{H}_{-}$ we have

[TABLE]

C.1 First-order derivatives

The first-order derivatives are used to find the modes for the Laplace approximations. As shown below, we can find these derivatives analytically; however, setting the derivatives equal to zero and solving for the parameters is not straightforward. Nevertheless, having these derivatives is useful not only as an intermediate step to finding the second-order derivatives but also for finding the modes: This allows us to provide numerical optimizers with the analytic expressions for the derivatives which can increase speed and accuracy for numerically finding the modes of the relevant functions.

The first-order derivative for $l_{0}(\beta)$ is given by:

[TABLE]

The first-order partial derivatives for $l^{\ast}(\beta,\psi)$ are given by

[TABLE]

and

[TABLE]

The first-order partial derivatives for $l_{+}^{\ast}(\beta,\xi)$ are given by:

[TABLE]

and

[TABLE]

The first-order partial derivatives for $l_{-}^{\ast}(\beta,\xi)$ are given by:

[TABLE]

and

[TABLE]

C.2 Second-order derivatives

For the Laplace approximations, we also need the inverse of the negative Hessians. The Hessian is the matrix with the second-order partial derivatives which is the reason why we now present expressions for the second-order partial derivatives. Note that under all hypotheses there are either one or two parameters. Hence, the Hessians will be at most 2 by 2 matrices. For matrices up to 2 by 2, it is straightforward to find the inverse and the determinant which makes it easy to obtain the quantities needed for the Laplace approximations once we have the required derivatives.

For $l_{0}^{\ast}(\beta)$ , there is only one parameter and the second-order derivative is given by:

[TABLE]

For $l^{\ast}(\beta,\psi)$ the second-order partial derivatives are given by

[TABLE]

and

[TABLE]

and

[TABLE]

For $l_{+}^{\ast}(\beta,\xi)$ the second-order partial derivatives are given by

[TABLE]

and

[TABLE]

and

[TABLE]

For $l_{-}^{\ast}(\beta,\xi)$ the second-order partial derivatives are given by

[TABLE]

and

[TABLE]

and

[TABLE]

C.3 Hessians

Having derived the relevant second-order partial derivatives, we can simply build the Hessian matrices of interest by inserting the relevant expressions. Next, we present symbolically the Hessians of interest, that is, we show which of the second-order partial derivatives need to be inserted where. Note that we omit the one for $\mathcal{H}_{0}$ since this is a single number which is simply the second-order derivative of $l_{0}^{\ast}(\beta)$ .

The Hessian for $\mathcal{H}_{1}$ is given by:

[TABLE]

The Hessian for $\mathcal{H}_{+}$ is given by:

[TABLE]

The Hessian for $\mathcal{H}_{-}$ is given by:

[TABLE]

C.3.1 Computing the inverse of the negative Hessians

Note that computing the inverses of the 2 by 2 negative Hessians is straightforward: We simply need to attach minus signs to each element of the Hessians and then make use of the fact that the inverse of a 2 by 2 matrix $\bm{A}=\begin{pmatrix}a&b\\ c&d\end{pmatrix}$ is given by $\bm{A}^{-1}=\frac{1}{\det\left(\bm{A}\right)}\begin{pmatrix}d&-b\\ -c&a\end{pmatrix}$ , where $\det\left(\bm{A}\right)=ad-bc$ .

Appendix D Example 1: effectiveness of resilience training (default analysis)

Here we present the results for the resilience training example obtained using the default prior setting.

D.1 Prior specification

We use the default prior setting in the \pkgabtest package that assigns both $\beta$ and $\psi$ standard normal prior distributions. The implied prior on the absolute risk can be visualized as follows:

R> library("abtest") R> plot_prior(what = "arisk")

The resulting graph is shown in the top panel of Figure 8.

The user can also visualize the (implied) prior for other quantities. For instance, the prior on the log odds ratio (middle panel of Figure 8) is obtained as follows:

R> plot_prior(what = "logor")

The implied prior on the success probabilities $p_{1}$ and $p_{2}$ (bottom panel of Figure 8) is obtained as follows:

R> plot_prior(what = "p1p2")

The bottom panel of Figure 8 illustrates that there is a dependency between $p_{1}$ and $p_{2}$ which is arguably desirable (Howard, 1998): When one of the success probabilities is very (small) large, it is likely that the other one will also be (small) large.

D.2 Hypothesis testing

The ab_test function can be used to conduct a Bayesian A/B test using the default prior setting as follows:

R> data("seqdata") R> set.seed(1) R> ab_default <- ab_test(data = seqdata)

This yields the following output:

R> print(ab_default)

Bayesian A/B Test Results:

Bayes Factors:

BF10: 0.2767214 BF+0: 0.4890489 BF-0: 0.05778357

Prior Probabilities Hypotheses:

H+: 0.25 H-: 0.25 H0: 0.5

Posterior Probabilities Hypotheses:

H+: 0.192 H-: 0.0227 H0: 0.7853

The first part of the output presents Bayes factors in favor of the hypotheses $\mathcal{H}_{1}$ , $\mathcal{H}_{+}$ , and $\mathcal{H}_{-}$ , where the reference hypothesis (i.e., denominator of the Bayes factor) is $\mathcal{H}_{0}$ . Since all three Bayes factors are smaller than 1, they all indicate evidence in favor of the null hypothesis of no effect. The next part of the output displays the prior probabilities of the hypotheses with non-zero prior probability. The final part of the output displays the posterior probabilities of the hypotheses with non-zero prior probability. The posterior probability of the null hypothesis $\mathcal{H}_{0}$ indicates that the data have increased the plausibility of the null hypothesis from $.50$ to $.79$ . Furthermore, the data have decreased the plausibility of both $\mathcal{H}_{+}$ and $\mathcal{H}_{-}$ .

The \pkgabtest package allows users to visualize the posterior probabilities of the hypotheses by means of a probability wheel (Figure 9):

R> prob_wheel(ab_default)

Overall, the data support the hypothesis that the training is ineffective over the hypothesis that the training has a positive effect. The Bayes factor for $\mathcal{H}_{0}$ over $\mathcal{H}_{+}$ equals $1/0.489\approx 2.04$ ; however, this indicates only anecdotal evidence (Jeffreys, 1939, Appendix I).

Since the data set is of a sequential nature, it may be of interest to consider not only the result based on all observations, but to conduct also a sequential analysis that tracks the evidential flow as a function of the total number of observations (i.e., the number of observations across both groups). This sequential analysis can be conducted as follows:

R> plot_sequential(ab_default, thin = 4)

Figure 10 displays the result of the sequential analysis. The sequential analysis indicates that after some initial fluctuation, adding more observations increased the probability of the null hypothesis that there is no effect of the training.

D.3 Parameter estimation

The data indicate only anecdotal evidence in favor of the null hypothesis versus the hypothesis that the training is effective, leaving open the possibility that the training does have an effect. To assess this possibility one may investigate the potential size of the effect under the assumption that the effect is non-zero. For parameter estimation, we generally prefer to investigate the posterior distribution for the unconstrained alternative hypothesis $\mathcal{H}_{1}$ .

The top panel of Figure 11 displays the posterior distribution for the absolute risk (i.e., $p_{2}-p_{1}$ ) that can be obtained as follows:

R> plot_posterior(ab_default, what = "arisk")

The top panel of Figure 11 shows the prior distribution as a dotted line and the posterior distribution (with 95% central credible interval) as a solid line. The plot indicates that, under the assumption that the difference between the two success probabilities is not exactly zero, the posterior median is $0.039$ and the 95% central credible interval ranges from $-0.022$ to $0.101$ .

The middle panel of Figure 11 displays the posterior distribution for the log odds ratio $\psi$ that can be obtained as follows:

R> plot_posterior(ab_default, what = "logor")

The middle panel of Figure 11 indicates that, given the log odds ratio is not exactly zero, it is likely to be between $-0.089$ and $0.406$ , where the posterior median is $0.159$ .

It may also be of interest to consider the marginal posterior distributions of the success probabilities $p_{1}$ and $p_{2}$ . This plot can be produced as follows:

R> plot_posterior(ab_default, what = "p1p2")

The bottom panel of Figure 11 displays the resulting plot. In this example, $p_{1}$ and $p_{2}$ correspond to the probability of still being on the job after six month for the non-trained employees and the employees that received the training, respectively. The bottom panel of Figure 11 indicates that the posterior median for $p_{1}$ is $0.498$ , with 95% credible ranging from $0.455$ to $0.542$ , and the posterior median for $p_{2}$ is $0.537$ , with 95% credible interval ranging from $0.494$ to $0.581$ .

In sum, based on a default prior analysis, this fictitious data set offers anecdotal evidence in favor of the null hypothesis which states that the training is not effective over the hypothesis that the training is effective; the consultancy firm should probably continue to collect data in order to obtain more compelling evidence before deciding whether or not the training should be implemented. If the true effect is as small as 4%, continued testing will ultimately show compelling evidence for $\mathcal{H}_{+}$ over $\mathcal{H}_{0}$ . Note that continued testing is trivial in the Bayesian framework: the results can simply be updated as new observations arrive.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Armitage (1960) Armitage P (1960). Sequential Medical Trials . Thomas, Springfield (IL).
2Azevedo et al. (2019) Azevedo EM, Alex D, Montiel Olea J, Rao JM, Weyl EG (2019). “A/B Testing with Fat Tails.” SSRN . URL http://dx.doi.org/10.2139/ssrn.3171224 . · doi ↗
3Bååth (2014) Bååth R (2014). “Bayesian First Aid: A Package that Implements Bayesian Alternatives to the Classical \code *.test Functions in \proglang R.” In Use R! 2014 - the International \proglang R User Conference .
4Bartlett (1957) Bartlett MS (1957). “A Comment on D. V. Lindley’s Statistical Paradox.” Biometrika , 44 , 533–534.
5Berger and Delampady (1987) Berger JO, Delampady M (1987). “Testing Precise Hypotheses.” Statistical Science , 2 , 317–352.
6Berger and Wolpert (1988) Berger JO, Wolpert RL (1988). The Likelihood Principle (2nd ed.) . Institute of Mathematical Statistics, Hayward (CA).
7Berman et al. (2018) Berman R, Pekelis L, Scott A, Van den Bulte C (2018). “p-Hacking and False Discovery in A/B Testing.” SSRN . URL http://dx.doi.org/10.2139/ssrn.3204791 . · doi ↗
8Bürkner (2017) Bürkner PC (2017). “ \pkg brms: An \proglang R Package for Bayesian Multilevel Models Using \proglang Stan.” Journal of Statistical Software , 80 , 1–28.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Informed Bayesian Inference for the A/B Test

1 Introduction

2 Example 1: effectiveness of resilience training

3 Implementation details

3.1 Model

3.2 Hypotheses

3.3 Parameter priors

3.4 Encoding prior information

3.5 Hypothesis testing

3.6 Obtaining posterior samples

4 Example 1: effectiveness of resilience training (continued)

4.1 Prior specification

4.2 Hypothesis testing

4.3 Parameter estimation

5 Example 2: progesterone in women with bleeding in early pregnancy

6 Concluding comments

7 Acknowledgements

Appendix A Interpretation of the parameters

Appendix B Prior elicitation: implied distributions

B.1 Log odds ratio

B.2 Odds ratio

B.3 Relative risk

B.4 Absolute risk

B.5 Joint distribution of p1p_{1}p1​ and p2p_{2}p2​

B.6 Marginal distribution of p1p_{1}p1​

B.7 Marginal distribution of p2p_{2}p2​

B.8 Conditional distribution of p2p_{2}p2​ given p1p_{1}p1​

B.9 Implied distributions for truncated priors on the log odds ratio

Appendix C Laplace approximation details

C.1 First-order derivatives

C.2 Second-order derivatives

C.3 Hessians

C.3.1 Computing the inverse of the negative Hessians

Appendix D Example 1: effectiveness of resilience training (default analysis)

D.1 Prior specification

D.2 Hypothesis testing

D.3 Parameter estimation

B.5 Joint distribution of $p_{1}$ and $p_{2}$

B.6 Marginal distribution of $p_{1}$

B.7 Marginal distribution of $p_{2}$

B.8 Conditional distribution of $p_{2}$ given $p_{1}$