Bayesian Additive Regression Trees for Group Testing Data

Madeleine E. St. Ville; Christopher S. McMahan; Joe D. Bible; Joshua M. Tebbs; Christopher R. Bilder

PMC · DOI:10.1002/sim.70052·March 14, 2025

Bayesian Additive Regression Trees for Group Testing Data

Madeleine E. St. Ville, Christopher S. McMahan, Joe D. Bible, Joshua M. Tebbs, Christopher R. Bilder

PDF

Open Access

TL;DR

This paper introduces a flexible Bayesian method to estimate disease probabilities using group testing data, even when test results may be inaccurate.

Contribution

The novelty is using Bayesian additive regression trees to model unknown covariate effects in group testing without assuming a fixed functional form.

Findings

01

The proposed method can estimate individual disease probabilities without knowing true statuses.

02

It handles misclassification and works with any group testing protocol.

03

The approach avoids model misspecification by allowing flexible covariate effects.

Abstract

When screening for low‐prevalence diseases, pooling specimens (e.g., blood, urine, swabs, etc.) through group testing has the potential to substantially reduce costs when compared to testing specimens individually. A common goal in group testing applications is to estimate the relationship between an individual's true disease status and their individual‐level covariate information. However, estimating such a relationship is a non‐trivial problem because true individual disease statuses are unknown due to the group testing protocol and the possibility of imperfect testing. While several regression methods have been developed in recent years to accommodate the complexity of group testing data, the functional form of covariate effects is typically assumed to be known. To avoid model misspecification and to provide a more flexible approach, we propose a Bayesian additive regression trees…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Chlamydia(genus)

Chemicals1

MPT

Diseases12

STD BART HIV chlamydia SARS-CoV-2 influenza A/B infectious disease infection Zika gonorrhea chlamydial tuberculosis

Figures4

Click any figure to enlarge with its caption.

Sum of regression trees illustration using two trees. Each tree uses two predictors. The tree on the left uses xi1 and xi2. The tree on the right uses xi2 and xi3. Each tree has three terminal nodes.

Averaged estimates of f(·) for model M3 under individual testing (IT), master pool testing (MPT), and Dorfman testing (DT). The true curve is shown in black. The solid red curve depicts the averaged posterior mean estimate based on 500 Monte Carlo data sets. The dotted curves depict 0.025 and 0.975 quantiles of the posterior mean estimates. Results for BART (with K=20 and K=200 trees) and the GLM fit [16] are shown. The same figure for model M1 is given in Appendix D of the Supporting Information.

Posterior estimates of the probability of chlamydial disease as a function of age for 32 cohorts. Top row: Informative priors for AC2A accuracy probabilities. Bottom row: Non‐informative priors. BART results are shown for K=20 and K=200 trees. Results from the GLM fit [16] are also shown.

Top: The solid red curve denotes posterior estimates of the probability of chlamydial disease for non‐Caucasian females who reported having a new sexual partner (xi2=0, xi3=1). The solid blue curve denotes the same estimates for Caucasian females (xi2=1, xi3=1). Bottom: Difference in the estimated probabilities for the two cohorts in the top row. Pointwise 95% equal‐tail credible bands are shown dotted. These figures were constructed by assuming informative priors for the AC2A accuracy probabilities and K=200 trees. The same figure under non‐informative priors is shown in Web Appendix F of the Supporting Information.

Tables2

TABLE 1. Terminal nodes assigned to five individuals (hypothetical data) for the K=2 regression trees in Figure 1. Values of ∑k=12g(xi;Tk,Mk) are also shown.

$i$	$x_{i 1}$	$x_{i 2}$	$x_{i 3}$	$g (x_{i}; T_{1}, M_{1})$	$g (x_{i}; T_{2}, M_{2})$	$\sum_{k = 1}^{2} g (x_{i}; T_{k}, M_{k})$
1	56	110	−13	$μ_{31}$	$μ_{12}$	$μ_{31} + μ_{12}$
2	27	173	−3	$μ_{21}$	$μ_{32}$	$μ_{21} + μ_{32}$
3	41	94	5	$μ_{11}$	$μ_{22}$	$μ_{11} + μ_{22}$
4	30	213	−9	$μ_{21}$	$μ_{12}$	$μ_{21} + μ_{12}$
5	48	168	39	$μ_{31}$	$μ_{32}$	$μ_{31} + μ_{32}$

TABLE 2. Iowa chlamydia data. In‐ and out‐of‐sample log‐likelihood calculated using posterior mean estimates. Results are shown for informative and non‐informative priors for the AC2A accuracy probabilities. Results from the GLM fit [16].

AC2A priors		BART ( $K = 20$ )	BART ( $K = 200$ )	GLM
Informative	In‐Sample	−3329.85	−3320.62	−4379.05
Informative	Out‐of‐Sample	−595.75	−594.95	−802.07
Non‐informative	In‐Sample	−3334.01	−3326.15	−3446.61
Non‐informative	Out‐of‐Sample	−595.98	−597.44	−616.18

Funding2

—National Institutes of Health 10.13039/100000002
—National Science Foundation 10.13039/100000001

Keywords

decision treeslatent variable modelingmachine learningnon‐parametric regressionpooled testing

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSARS-CoV-2 detection and testing · Data-Driven Disease Surveillance · Biosensors and Analytical Detection

Full text

Introduction

1

When screening for infectious diseases, group testing has become a popular alternative to individual testing due to its cost‐effectiveness and efficiency. Fundamentally, in group testing one combines individual specimens (e.g., blood, urine, swabs, etc.) to form a pooled specimen that is tested for the presence of disease. In many group testing protocols, individuals contributing to a pooled specimen that tests negatively are classified as negative at the expense of just a single diagnostic assay; in contrast, positive pools are resolved through further testing. The infectious disease literature is replete with applications of group testing being used to detect HIV [1, 2], gonorrhea and chlamydia [3], influenza A/B [4], Zika [5], tuberculosis [6], and SARS‐CoV‐2 [7, 8].

A common task in disease surveillance is estimating the probability of disease for individuals and identifying associated risk factors. However, accurately estimating the relationship between an individual's disease status and their covariates from group testing data is a non‐trivial problem; the true individual responses are obscured by the group testing protocol, and testing responses are potentially misclassified due to imperfect testing. These complications notwithstanding, there has been substantial growth in the development of regression methods for group testing data. Existing research includes parametric approaches by Vansteelandt et al. [9], Huang and Tebbs [10], Chen et al. [11], and Delaigle and Tan [12], as well as non‐parametric approaches by Delaigle and Meister [13], Delaigle et al. [14], and Delaigle and Hall [15]. McMahan et al. [16] recently proposed a Bayesian approach within a generalized linear model (GLM) framework that boasts three strengths; namely, one can analyze data arising from any group testing protocol to include retesting information, one can incorporate historical information about disease prevalence, and the framework allows for the estimation of assay accuracy probabilities. McMahan et al. [16] also motivated the development of several modeling extensions [17, 18]. However, when viewed collectively, a limitation of many existing methods is that they require one to specify the functional form of the relationship between disease status and multiple covariates. This can be challenging to do when the relationship is complex. In particular, non‐linear and/or high‐order interaction effects are potentially ignored which can result in model misspecification and biased inference.

In this article, we propose a Bayesian additive regression trees (BART) modeling framework for group testing data. BART is an ensemble machine learning technique and is well‐equipped to handle large complex group testing data sets. It builds a series of decision trees that partition the input space into regions and make predictions based on covariate values within each region. It automatically captures complex high‐order interaction effects without requiring the researcher to specify anything about the functional form. Furthermore, BART extends the ensemble of decision trees by incorporating Bayesian modeling to quantify uncertainty in parameter estimates and regularize the fit. Our approach preserves the core strengths of McMahan et al. [16] while allowing for more flexible modeling. In turn, this provides a more accurate estimation and a better understanding of the relationship between disease status and covariates.

The remainder of this article is organized as follows. In Section 2, we introduce our proposed BART model and discuss assumptions. In Section 3, we describe the data augmentation steps that facilitate a Bayesian framework and provide details on posterior sampling. In Section 4, we present the results of simulation studies to assess performance under a variety of settings for different group testing protocols. In Section 5, we use our methods to analyze a chlamydia group testing data set collected at a public health laboratory in Iowa. In Section 6, we conclude with a summary discussion. Additional technical details, simulation evidence, and data analysis results are shown in the Supporting Information.

Notation and Model

2

Suppose a group testing protocol is used to test N individuals for a binary characteristic, such as the presence or absence of a disease. Let Y˜i, for i=1,…,N, denote the true disease status of the ith individual, with the usual convention that Y˜i=1 if the individual is truly positive, Y˜i=0 otherwise. Let xi=(xi1,…,xiQ)′ denote a vector of Q covariates observed for the ith individual. We aggregate the individuals' true disease statuses into the vector Y˜=(Y˜1,…,Y˜N)′ and their covariates into the matrix X=(x1⋯xN). For modeling purposes, we assume the individuals' true disease statuses are conditionally independent given their individual‐level covariates and that the relationship between Y˜i and xi is

[eqn]

where Φ−1(·) is the inverse cumulative distribution function of a standard normal random variable (i.e., the probit link function) and f(·) is an unknown function, regarded as an infinite‐dimensional parameter. To reduce its dimension while also maintaining adequate modeling flexibility, we approximate f(·) by using additive regression trees; i.e., we approximate f(·) by an ensemble of K regression trees in the following manner:

[eqn]

where Tk is the kth regression tree structure consisting of a set of interior node decision rules and a set of bk terminal nodes and Mk=(μk1,…,μkbk)′ is a bk‐dimensional vector of terminal node parameters. The function g(xi;Tk,Mk) returns the value μkt∈Mk if xi is assigned to the tth terminal node based on the interior node decision rules of Tk. The decision rules provide information regarding which covariate to split on and the associated cutoff value. These are binary splits based on a single covariate and are of the form {xiq≤c} vs. {xiq>c} for a cutoff value c. Taken together, g(xi;Tk,Mk) can be viewed as a multi‐dimensional step function that can aptly account for many features, including non‐linear effects and interactions of varying orders. Note that in (2), K is the (typically fixed) number of regression trees. Setting K to be large is recommended for flexible estimation, for example, Chipman et al. [19] showed the default K=200 yields good predictive performance.

For readers unfamiliar with the sum‐of‐trees model, consider the following simple example with an ensemble of K=2 trees and Q=3 covariates. Suppose we are given the two trees in Figure 1. Each tree uses two predictors to split the data into subgroups; the first tree (k=1) on the left in Figure 1 uses xi1 and xi2, while the second tree (k=2) on the right uses xi2 and xi3. For each tree, each value of xi is assigned to a single terminal node by following a sequence of decision rules at each interior node from top to bottom, where it is finally assigned a parameter value associated with that terminal node. For the hypothetical data on 5 individuals given in Table 1, one can see the quantity being “summed” in the final model for the ith individual is the terminal node parameter each tree structure assigns to this individual. Tan and Roy [20] provide additional regression tree examples in their excellent expository review of BART and its associated models.

Sum of regression trees illustration using two trees. Each tree uses two predictors. The tree on the left uses xi1 and xi2. The tree on the right uses xi2 and xi3. Each tree has three terminal nodes.

BART casts the sum‐of‐trees model into a Bayesian paradigm and controls the size and effect of individual trees by imposing regularization priors [19]. Of course, if individuals' true disease statuses were observed, one could estimate the BART model using existing R packages; e.g., BayesTree [19], bartMachine [21], BART [22], etc. However, due to the effects of both pooling and imperfect testing, the individuals' true disease statuses are unobserved in group testing. Instead, the observed data available for model fitting consist of possibly error‐contaminated test responses that are measured on pools and/or individuals according to the group testing protocol in use. Further complicating the data structure, many protocols require individuals to be tested in multiple pools including pools which may overlap with others [1, 23].

To maintain generality and to accommodate data from any group testing protocol, we track pool membership through the index sets 𝒫j⊂{1,2,…,N}, where 𝒫j consists of the indices of individuals who contributed to the jth pool, j=1,…,J. Let Zj denote the test outcome from assaying the jth pool, where Zj=1 if the pool tests positively and Zj=0 otherwise. The sensitivity and specificity of the assay used to test the jth pool are Sej=P(Zj=1|Z˜j=1) and Spj=P(Zj=0|Z˜j=0), respectively, where Z˜j is the true disease status of the jth pool, that is, Z˜j=I(∑i∈𝒫jY˜i>0), where I(·) is the indicator function. In some settings, it may be reasonable to assume Sej and Spj are known a priori. If not, one can regard these as unknown parameters and estimate them following the approach outlined in McMahan et al. [16]. That is, one can divide the test outcomes into L strata created by factors related to assay characteristics and performance (e.g., pool size, type of assay used, specimen type, etc.) and define the index set ℳ(l)={j:thejth test outcome is in thelth stratum}. We assume assay accuracy probabilities are constant within each stratum but potentially vary among the L strata. Define Se(l) and Sp(l) to be the sensitivity and specificity of the assay associated with the lth stratum, respectively, that is, Sej=Se(l) and Spj=Sp(l) if and only if j∈ℳ(l). Examples of how strata might be created in practice are shown in Sections 4 and 5.

The conditional distribution of Z=(Z1,…,ZJ)′ can be expressed as

[eqn]

where Se=(Se(1),…,Se(L))′, Sp=(Sp(1),…,Sp(L))′, T is the collection of regression trees T1,…,TK, M=(M1′,…,MK′)′, and ηi=η(xi). To derive (3), we assume the observed testing responses in Z are conditionally independent given their true statuses Z˜=(Z˜1,…,Z˜J)′ and Z|Z˜ does not depend on the covariates in X. These assumptions are common in the group testing literature for regression; see, e.g., Vansteelandt et al. [9] and Xie [24], and are reasonable when misclassification is driven primarily by factors related to test implementation. Evaluating the data model in (3) requires taking the sum over {0,1}N, which denotes the collection of all 2N possible realizations of Y˜. For this reason, evaluating (3) can be computationally burdensome at best and completely intractable at worst. We make use of a data augmentation strategy, described in Section 3.1, to develop a posterior sampling algorithm that circumvents the need to directly evaluate this data model. This posterior algorithm operates in the same way regardless of which group testing protocol is used to collect the data.

To complete our Bayesian model, we specify priors for each of the unknown model parameters; i.e., the parameters governing the sum‐of‐trees model in (2) and the assay accuracy probabilities. Recall the sum‐of‐trees model is determined by K trees and the corresponding terminal node parameters, that is, (T1,M1),…,(TK,MK). Thus, we must impose priors on the kth tree structure, Tk, and the terminal node parameters given this kth tree structure, Mk|Tk, for each k=1,…,K. Assuming (T1,M1),…,(TK,MK) are independent, we can write the prior distribution as

[eqn]

noting the last equality follows from the assumption that terminal node parameters are conditionally independent given their tree structure. To elicit the priors π(Tk) and π(μkt|Tk), we follow the approach in Chipman et al. [19], which we now describe. We first specify π(Tk), the prior distribution on the kth tree structure, based on three probabilistic rules that control the size (i.e., number of terminal nodes) of the tree, the variables on which to split, and the locations of the split. The size of the tree is based on the depth of the terminal nodes, where a node at depth d∈{0,1,2,…} is non‐terminal (i.e., an interior node) with probability α(1+d)−β, where α∈(0,1) and β∈[0,∞). Default values of the hyperparameters recommended by Chipman et al. [19], α=0.95 and β=2, are used. These values tend to a priori favor smaller trees, for example, trees having 2 to 3 terminal nodes. For non‐terminal nodes, the variable on which to split is randomly selected from the set of available covariates, and the location of the split given the selected variable is sampled at random from the set of observed values for that variable. The prior for the terminal node parameters is μkt∼N(0,σμ2), where σμ=3.0/(HK). Chipman et al. [19] recommend H=2 as the default hyperparameter value, which we also adopt. The aim of this prior is to provide model regularization; it has the ability to shrink terminal node parameters thereby limiting the effect of the individual tree components. Finally, to acknowledge uncertainty in the assay accuracy probabilities, we specify independent beta priors Se(l)∼beta(ae(l),be(l)) and Sp(l)∼beta(ap(l),bp(l)), for l=1,…,L. When historical information about assay performance is available (e.g., from pilot studies used to validate the assay, etc.), one can incorporate it into the model by choosing hyperparameter values to reflect this prior knowledge. In the absence of any prior information, one can select ae(l)=be(l)=ap(l)=bp(l)=1 corresponding to uniform priors.

Posterior Inference

3

Data Augmentation

3.1

To facilitate the development of an efficient posterior sampling algorithm and to avoid direct evaluation of (3), we propose a two‐stage data augmentation procedure. In the first stage, we introduce the individuals' true disease statuses Y˜i as latent random variables. This leads to the following joint conditional distribution:

[eqn]

Making use of the fact our model employs the probit link function, the second stage introduces a carefully constructed latent random variable ωi for each i=1,…,N. These random variables independently follow a standard normal distribution such that ωi>0 if Y˜i=1 and ωi≤0 if Y˜i=0; see Albert and Chib [25]. This stage yields the following augmented likelihood:

[eqn]

where ω=(ω1,…,ωN)′ and ϕ(·) denotes the standard normal probability density function. These two stages of augmentation, together with our prior specifications, allow for the construction of a full Gibbs sampling algorithm for posterior inference.

Posterior Sampling

3.2

We briefly describe the full conditional distributions used in our posterior sampling algorithm. A complete description of the algorithm is in Appendix A of the Supporting Information. From (4), one can show the full conditional of Y˜i is Y˜i|Z,Y˜−i,Se,Sp,T,M∼Bernoulli{pi1∗/(pi0∗+pi1∗)}, where Y˜−i is the vector Y˜ with the ith element removed,

[eqn]

sij=∑i′∈𝒫j:i′≠iY˜i′, and ℐi(l)={j∈ℳ(l):i∈𝒫j}. The full conditional of ωi is truncated normal, where truncation depends on the ith latent disease status Y˜i, specifically,

[eqn]

for i=1,…,N. The notation TN{μ,σ2,(a,b)} identifies a truncated normal distribution with mean μ, variance σ2, and support over the interval (a,b). Sampling the sum‐of‐trees model parameters is possible given the latent variables and the form of the augmented likelihood in (4). We follow the Bayesian back fitting algorithm of Chipman et al. [19] to sample all parameters associated with the regression trees. Appendices B and C of the Supporting Information contain details and posteriors for the sum‐of‐trees model parameters. Finally, full conditional distributions of the assay accuracy probabilities are also beta, namely,

[eqn]

for l=1,…,L, where ae(l)∗=ae(l)+∑j∈ℳ(l)ZjZ˜j, be(l)∗=be(l)+∑j∈ℳ(l)(1−Zj)Z˜j, ap(l)∗=ap(l)+∑j∈ℳ(l)(1−Zj)(1−Z˜j), and bp(l)∗=bp(l)+∑j∈ℳ(l)Zj(1−Z˜j).

Variable Importance

3.3

After a suitable burn‐in period, our posterior algorithm returns S posterior samples of the K regression tree structures. A byproduct of BART is its ability to provide a measure of variable importance, following the approach of Chipman et al. [19]. Within a posterior sample, we can compute the proportion of times a particular covariate is used as a splitting variable among all decision rules in the ensemble of K trees. We can then estimate the variable inclusion proportion for this covariate as the mean of these proportions across the S posterior samples. In particular, let zqs be the number of decision rules that use the qth covariate as the splitting variable in the sth posterior draw of the sum‐of‐trees model and let z.s=∑q=1Qzqs be the total number of decision rules in the sth posterior sample. We define

[eqn]

to be the variable inclusion proportion for the qth covariate. Intuitively, covariates with large proportions vq are identified as influential predictors of disease status. Therefore, covariates can be ranked on the basis of their relative importance in terms of prediction. This strategy is more effective when the number of trees K is small because predictors will be forced to compete with each other to improve the fit [19]. Although this measure is widely used, it does have limitations resulting from a tendency of BART to overfit noise [26]. Thus, caution is warranted when using inclusion proportions to select a subset of important predictors.

Simulation Evidence

4

Simulation Description

4.1

This section summarizes numerical studies that evaluate the performance of BART for group testing data. We considered three population‐level models, all of which follow the form of (1), with xi=(xi1,xi2,xi3)′ as a vector of Q=3 covariates, xi1 and xi2 are uniform (0,10), and xi3∼Bernoulli(0.5). The true models are

[eqn]

where β=(β0,β1,β2,β3)′=(−0.85,0.55,−1.25,−0.35)′. All three models provide a population‐level prevalence similar to the data application in Section 5. The first model (M1) was chosen to evaluate BART's performance when the data exhibit a highly non‐linear pattern. The second (M2) allows one to assess the potential loss from a BART fit when a GLM is appropriate. The third (M3) allows for non‐linear interactions emulating patterns found in the motivating data. For each model, we generated N=5000 individual true disease statuses according to Y˜i∼Bernoulli{Φ(f(xi)}. We then repeated this process 500 times independently to create 500 individual‐level data sets.

Group testing outcomes were generated for two protocols: Master pool testing (MPT) and Dorfman testing [27] (DT). Under MPT, individuals are assigned to exactly one master pool which is tested but no further testing is performed regardless of the outcome. Therefore, although it is not possible to diagnose all individuals as positive or not under MPT, this protocol can still be used to estimate the relationship between disease status and covariates. On the other hand, DT is a two‐stage protocol where positive pools from the first stage (under MPT) are resolved by individually retesting all subjects in the pools. For both protocols, we randomly assigned individuals to master pools of size 4, and the testing response for the jth pool Zj was generated as Zj|Z˜j∼Bernoulli{SejZ˜j+(1−Spj)}, where Z˜j=I(∑i∈𝒫jY˜i>0). We considered two configurations for the assay accuracy probabilities:

C1. Sej=0.95 and Spj=0.98, assumed to be known and the same for all J pools
C2.(for DT only) Se(1)=0.95 and Sp(1)=0.98 are assumed for master pools in the first stage; Se(2)=0.98 and Sp(2)=0.99 are assumed for individual tests in the second stage. These four parameters are regarded to be unknown and are estimated concurrently with the population‐level model.

Our methods are evaluated under two BART configurations: One with K=20 trees and one with K=200 trees. For the model parameters associated with the regression trees, we used the default prior specifications described in Section 2. When estimating the assay accuracy probabilities (C2), we elicited independent uniform priors for Se(l) and Sp(l), for l=1,2. This represents the most challenging situation for estimation because no prior information regarding assay accuracy is introduced. We used our posterior sampling algorithm to draw S=2500 samples (after a burn‐in of 2 500 samples) and assessed convergence by using standard MCMC diagnostics. For comparison purposes, we also simultaneously estimated the Bayesian GLM from McMahan et al. [16], entering the three covariates as linear terms and specifying diffuse priors for the regression coefficients. This allows us to document the potential advantages and disadvantages of BART for group testing, noting the GLM is correctly specified for model M2 and misspecified otherwise.

All estimation results are averaged over the 500 data sets. To evaluate in‐ and out‐of‐sample classification accuracy, we performed a receiver operating characteristic (ROC) curve analysis, summarized by the area under the curve (AUC). To assess out‐of‐sample accuracy for each model fit, we simulated 1 000 new individual disease statuses using the process outlined above and then used our models to predict their disease probabilities and compute the associated AUC scores. Variable inclusion proportions vq were recorded for each covariate based on the S=2500 MCMC samples. For comparison, individual testing (IT) was also implemented for the configurations that assume assay accuracy probabilities Sej and Spj are known. This comparison is always of interest in group testing as it allows one to assess the information lost from pooling.

Simulation Results

4.2

Figure 2 summarizes the averaged estimates of f(·) under the third population‐level model (M3) for IT, MPT, and DT when assay accuracy probabilities are known. This model includes the binary covariate xi3 so the true function, shown in solid black in Figure 2, is bifurcated depending on its value. Figure D.2, shown in Appendix D of the Supporting Information, provides the same summary for the first population‐level model (M1). In both figures, one notes the results for BART are similar when comparing the number of trees, K=20 and K=200, for both MPT and DT. Figure 2 shows that estimated functions from both BART configurations are in agreement with the true regression function for both group testing protocols under model M3. Estimation is inherently more difficult under model M1 as f(xi)=sin(πxi1)−1.25 is a highly non‐linear function. In fact, both the MPT and DT protocols struggle to recover the “peaks and valleys” of this function exactly (see Figure D.2), but DT does produce results similar to IT. This is noteworthy as IT uses 5 000 tests, one for each individual, while DT uses a small fraction of this number. Both Figure 2 and Figure D.2 show how disparaging the fit can be when one estimates a first‐order GLM.

Averaged estimates of f(·) for model M3 under individual testing (IT), master pool testing (MPT), and Dorfman testing (DT). The true curve is shown in black. The solid red curve depicts the averaged posterior mean estimate based on 500 Monte Carlo data sets. The dotted curves depict 0.025 and 0.975 quantiles of the posterior mean estimates. Results for BART (with K=20 and K=200 trees) and the GLM fit [16] are shown. The same figure for model M1 is given in Appendix D of the Supporting Information.

Table D.2 in Appendix D of the Supporting Information summarizes our ROC analyses for the in‐ and out‐of‐sample classification accuracy under all population‐level models when the assay accuracy probabilities are known. For both group testing protocols, BART has substantially better classification accuracy for models M1 and M3 than the incorrectly specified GLM. At the same time, BART performs just as well as the correctly specified GLM for model M2, showing that little is lost, at least in terms of classification accuracy, when estimating the more complex BART model with group testing data. AUC scores are consistently higher for DT when compared to MPT, and the choice of K=20 or K=200 trees does not produce noticeably different results. Figure D.4 in Appendix D of the Supporting Information plots variable inclusion proportions for both BART configurations when the assay accuracy probabilities are known. The only predictor used in model M1 is xi1 and BART successfully identifies it as having the largest proportion among v1, v2, and v3 for both group testing protocols, more noticeably when using K=20 trees. Similarly, the only predictor not used in model M3 is xi2, and v2 is consistently the lowest among the three proportions when this smaller number of trees is used.

Our simulation studies show the main findings summarized above remain unchanged when the assay accuracy probabilities Se(1), Sp(1), Se(2), and Sp(2) are unknown and are estimated alongside the population‐level models (see Configuration C2 in Section 4.1). This is significant because we used uniform distributions as prior models for these four probabilities. When these priors are used, there is simply not enough information to learn about assay accuracy under IT or MPT [16]. However, using DT affords one the opportunity to learn about these probabilities when confirmatory or counterfactual test results are observed in the two stages, for example, when a master pool tests positively in the first stage and one or more individuals in the pool test negatively in the second. In fact, Table D.1 in Appendix D of the Supporting Information shows all four probabilities are estimated correctly on average under DT for both BART configurations under all population‐level models. Furthermore, estimated coverage probabilities of nominal 95% equal‐tail credible intervals for these parameters are either within the margin of Monte Carlo error (±0.03) or slightly conservative except when one estimates a GLM incorrectly. Figures D.1 and D.3 show that averaged estimates of f(·) are largely unchanged when assay accuracy probabilities are estimated under models M1 and M3. Additional results in Appendix D convey the same findings for classification accuracy and variable inclusion.

IOWA Chlamydia Data

5

Data Description and Model

5.1

The State Hygienic Laboratory (SHL) at the University of Iowa is the largest public health laboratory in Iowa. Each year, the lab tests thousands of Iowa residents for chlamydia as part of sexually transmitted disease (STD) prevention programs. Individual endocervical swabs and urine specimens are collected from various clinics throughout the state which are then transported to the SHL. The current SHL screening procedure requires all urine specimens to be tested individually, while Dorfman testing (DT) is used for swab specimens. That is, swab specimens are first tested in master pools, usually of size 4, and positive pools are resolved by testing all specimens separately. The SHL uses the Aptima Combo 2 Assay (AC2A) to test all specimens, both pooled and individual. Pilot data describing the accuracy of the AC2A for individual testing are summarized in the assay's product literature and Gaydos et al. [28] We reproduce a portion of these data in Appendix E of the Supporting Information.

We analyze data for N=13862 female subjects tested at the SHL during the 2014 calendar year. These data consist of chlamydia test responses from 4 316 individual urine specimens, 416 individual swab specimens, 2 273 swab master pools of size 4, 12 swab master pools of size 3, one swab master pool of size 2, and additional retesting responses required to resolve positive swab master pools. We consider six covariates as potential risk factors: Age (in years, denoted by xi1), a race indicator for the ith individual (xi2=1 if Caucasian), an indicator denoting whether the individual reported a new sexual partner in the last 90 days (xi3=1), an indicator denoting whether the individual reported having multiple sexual partners in the last 90 days (xi4=1), an indicator denoting whether the individual reported sexual contact with an STD‐positive partner in the previous year (xi5=1), and an indicator denoting whether the individual presented symptoms of infection (xi6=1). All binary covariates were recorded as “0” if the corresponding condition was not met. To relate an individual's true chlamydia status to their available covariate information, we estimate

[eqn]

where xi=(xi1,…,xi6)′, for i=1,…,13862. We use two BART configurations, one with K=20 trees and one with K=200 trees. We also estimated a GLM from McMahan et al. [16] for comparison, entering the six covariates in a linear fashion and eliciting non‐informative priors for the regression coefficients.

Prior distributions for the parameters in the sum‐of‐trees model and the assay accuracy probabilities described in Section 2 were used in our analysis. Although lab technicians at the SHL used the AC2A to test all specimen types, it is important to acknowledge differences in how this assay might perform when testing swab and urine specimens [28] and when testing individuals and pools. With this in mind, we divide all test responses into L=3 strata with the following AC2A accuracy probabilities: Se(1) and Sp(1) for swab specimens tested individually, Se(2) and Sp(2) for urine specimens tested individually, and Se(3) and Sp(3) for swab specimens tested in pools. For beta prior elicitation, we first modeled these six parameters informatively using the AC2A pilot data in Appendix E of the Supporting Information. We performed separate analyses where these probabilities were modeled non‐informatively with uniform distributions over (0,1).

Analysis and Results

5.2

We first evaluate the predictive performance of BART on the basis of the number trees K and the prior models for the six AC2A accuracy probabilities. To do this, we randomly split the data into training and test sets, where 85% of the data was used to train the model and the remaining 15% was allocated to the test set. The true chlamydia statuses are not observed in this application due to the potential of imperfect testing. Therefore, it is not appropriate to perform a ROC curve analysis as we did in Section 4 where individual true disease statuses were simulated directly. We instead assess predictive performance through the log‐likelihood, which is analogous to assessing model‐based performance using the cross‐entropy loss function [29]. Using posterior mean parameter estimates, Table 2 reports the log‐likelihood values for the in‐ and out‐of‐sample sets under four BART configurations which arise from cross‐classifying the number of trees (K=20, K=200) and the assay prior modeling assumptions (informative, non‐informative). The BART results are nearly identical under each configuration for in‐ and out‐of‐sample data sets. Furthermore, the predictive performance of BART is superior to the GLM with linear terms only, more noticeably when informative priors for the AC2A probabilities are used.

Figure 3 shows estimated probabilities of chlamydial disease as a function of age for 25=32 risk profiles, created by forming all possible combinations of the five binary covariates (defined in Section 5.1). These estimated profiles are shown under both informative prior assumptions for the AC2A probabilities (top row) and non‐informative priors (bottom row), and we include the first‐order GLM results for comparison. The same non‐linear age effect is seen for both choices of K under BART, but the K=200 configuration provides smoother functional estimates as expected. Furthermore, estimated risk profiles that cross suggest the presence of interaction among the binary covariates and age. BART flexibly captures such interactions without having to specify the functional form of the relationship between disease status and covariates. It is interesting that BART detects a possible increase in chlamydial risk after the age of 50 years; however, it may be an overreaction to infer this for the population of Iowa females. Our data set included only 82 females aged 50 years or older (out of 13 862 total), and most of these subjects visited clinics primarily for STD services.

Posterior estimates of the probability of chlamydial disease as a function of age for 32 cohorts. Top row: Informative priors for AC2A accuracy probabilities. Bottom row: Non‐informative priors. BART results are shown for K=20 and K=200 trees. Results from the GLM fit [16] are also shown.

To further highlight the flexibility of BART when analyzing group testing data, Figure 4 (top row) shows estimated probabilities of disease as a function of age for two cohorts of subjects: 582 non‐Caucasian females who reported having a new sexual partner (xi2=0, xi3=1) and 3 199 Caucasian females who reported having a new sexual partner (xi2=1, xi3=1). The bottom row of Figure 4 plots the difference in the estimates for these two cohorts. The corresponding estimates from fitting a first‐order GLM are also shown for comparison purposes, and all subfigures include pointwise 95% equal‐tail credible bands. The two cohorts identified in Figure 4 are potentially of interest from a screening policy point of view. The United States Preventive Services Task Force [30] has recommended annual chlamydia testing for all sexually active females in the United States aged 24 years and younger. Our BART analysis produces results that are supportive of this recommendation and furthermore detects a significant difference between younger females in these two groups. Specifically, the probability of disease is significantly higher for non‐Caucasian females up until the age of approximately 22 years but is not statistically different after this age. Finally, note that Figure 4 was constructed using our BART model estimates with K=200 trees under informative prior distributions for the AC2A accuracy probabilities. The same figure under non‐informative priors reveals similar findings and is shown in Appendix F of the Supporting Information. This appendix also includes additional Iowa data analysis results, including variable importance proportions and estimates of the AC2A accuracy probabilities under both informative and non‐informative prior model assumptions.

Top: The solid red curve denotes posterior estimates of the probability of chlamydial disease for non‐Caucasian females who reported having a new sexual partner (xi2=0, xi3=1). The solid blue curve denotes the same estimates for Caucasian females (xi2=1, xi3=1). Bottom: Difference in the estimated probabilities for the two cohorts in the top row. Pointwise 95% equal‐tail credible bands are shown dotted. These figures were constructed by assuming informative priors for the AC2A accuracy probabilities and K=200 trees. The same figure under non‐informative priors is shown in Web Appendix F of the Supporting Information.

Discussion

6

We have developed a general BART regression framework for group testing data with individual‐level covariates. Our methods expand on the Bayesian approach in McMahan et al. [16] to seamlessly incorporate non‐linear main effects and potentially high‐order interaction effects while estimating assay accuracy probabilities. BART can also assess variable importance by examining relative frequencies with which covariates are used as splitting variables in the posterior samples. R code for data analysis is available on the first author's GitHub website https://github.com/mstville/bart_gt.

In our survey of the group testing literature for regression analysis, we can find no other approach as flexible as the one presented in this paper. At the same time, several modeling enhancements could be of interest, including BART and other machine learning methods for group testing data with random effects, [11, 17] pool responses which are subject to dilution [31, 32], and missing covariates [12]. Another research direction would be to develop flexible regression methods for multivariate binary outcomes arising from group testing with multiplex assays, that is, assays that test pooled specimens for multiple diseases at once. Group testing case identification protocols with multiplex assays have been proposed in Tebbs et al. [33], Hou et al. [34], and Hou et al. [35].

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Supporting Information.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1D. Westreich , M. Hudgens , S. Fiscus , and C. Pilcher , “Optimizing Screening for Acute Human Immunodeficiency Virus Infection With Pooled Nucleic Acid Amplification Tests,” Journal of Clinical Microbiology 46 (2008): 1785–1792.18353930 10.1128/JCM.00787-07PMC 2395100 · doi ↗ · pubmed ↗
2M. Krajden , D. Cook , A. Mak , et al., “Pooled Nucleic Acid Testing Increases the Diagnositc Yield of Acute HIV Infections in High‐Risk Population Compared to 3rd and 4th Generation HIV Enzyme Immunoassays,” Journal of Clinical Virology 61 (2014): 132–137.25037533 10.1016/j.jcv.2014.06.024 · doi ↗ · pubmed ↗
3J. Lewis , V. Lockary , and S. Kobic , “Cost Savings Increased Efficiency Using a Stratified Specimen Pooling Strategy for Chlamydia Trachomatis and Neisseria Gonorrhoeae ,” Sexually Transmitted Diseases 39 (2012): 46–48.22183846 10.1097/OLQ.0b 013e 318231 cd 4a · doi ↗ · pubmed ↗
4T. Van , J. Miller , D. Warshauer , et al., “Pooling Nasopharyngeal/Throat Swab Specimens to Increase Testing Capacity for Influenza Viruses by PCR,” Journal of Clinical Microbiology 50 (2012): 891–896.22205820 10.1128/JCM.05631-11PMC 3295167 · doi ↗ · pubmed ↗
5P. Saá , M. Proctor , G. Foster , et al., “Investigational Testing for Zika Virus Among US Blood Donors,” New England Journal of Medicine 378 (2018): 1778–1788.29742375 10.1056/NEJ Moa 1714977 · doi ↗ · pubmed ↗
6S. Abdurrahman , O. Mbanaso , L. Lawson , et al., “Testing Pooled Sputum With Xpert MTB/RIF for Diagnosis of Pulmonary Tuberculosis to Increase Affordability in Low‐Income Countries,” Journal of Clinical Microbiology 53 (2015): 2502–2508.26019204 10.1128/JCM.00864-15PMC 4508401 · doi ↗ · pubmed ↗
7I. Torres , E. Albert , and D. Navarro , “Pooling of Nasopharyngeal Swab Specimens for SARS‐Co V‐2 Detection by RT‐PCR,” Journal of Medical Virology 92 (2020): 2306–2307.32369202 10.1002/jmv.25971 PMC 7267454 · doi ↗ · pubmed ↗
8B. Abdalhamid , C. Bilder , E. Mc Cutchen , S. Hinrichs , S. Koepsell , and P. Iwen , “Assessment of Specimen Pooling to Conserve SARS Co V‐2 Testing Resources,” American Journal of Clinical Pathology 153 (2020): 715–718.32304208 10.1093/ajcp/aqaa 064PMC 7188150 · doi ↗ · pubmed ↗