Patient stratification in multi-arm trials: a two-stage procedure with Bayesian profile regression
Yuejia Xu, Angela M. Wood, Brian D.M. Tom

TL;DR
This paper introduces a novel two-stage Bayesian nonparametric method for patient stratification in multi-arm clinical trials, combining outcome prediction and subgroup clustering to improve personalized treatment analysis.
Contribution
It develops a new Bayesian framework that integrates outcome prediction with patient clustering and variable selection specifically for multi-arm trial settings.
Findings
Successfully identified five meaningful donor subgroups in a blood donation trial.
Demonstrated superior performance over existing methods in simulation studies.
Provides a flexible approach adaptable to various multi-arm clinical trial designs.
Abstract
Precision medicine is an emerging field that takes into account individual heterogeneity to inform better clinical practice. In clinical trials, the evaluation of treatment effect heterogeneity is an important component, and recently, many statistical methods have been proposed for stratifying patients into different subgroups based on such heterogeneity. However, the majority of existing methods developed for this purpose focus on the case with a dichotomous treatment and are not directly applicable to multi-arm trials. In this paper, we consider the problem of patient stratification in multi-arm trial settings and propose a two-stage procedure within the Bayesian nonparametric framework. Specifically, we first use Bayesian additive regression trees (BART) to predict potential outcomes (treatment responses) under different treatment options for each patient, and then we leverage…
| cluster 1 | cluster 2 | cluster 3 | |
|---|---|---|---|
| 2 | 4 | 6 | |
| 4 | 6 | 1 | |
| 5 | 1 | 3 | |
| 4 | 4 | 4 | |
| 4 | 7 | 6 | |
| 4 | 9 | 5 |
| cluster 1 | cluster 2 | cluster 3 | |
|---|---|---|---|
| 2 | 4 | 6 | |
| 4 | 6 | 1 | |
| 5 | 1 | 3 | |
| 4 | 4 | 4 | |
| 4 | 7 | 6 | |
| 4 | 9 | 5 |
| cluster 1 | cluster 2 | cluster 3 | cluster 4 | |
| 0.2 | 0.4 | 0.6 | 0.8 | |
| 0.4 | 0.4 | 0.6 | 0.6 | |
| 0.1 | 0.2 | 0.3 | 0.4 | |
| 0.15 | 0.3 | 0.15 | 0.3 | |
| 2 | 4 | 8 | 6 | |
| 2 | 2 | 2 | 3 | |
| 2 | 5 | 4 | 6 | |
| 2 | 4 | 5 | 8 | |
| 2 | 3 | 6 | 8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods in Clinical Trials · Statistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models
Patient stratification in multi-arm trials:
a two-stage procedure with Bayesian profile regression
Yuejia Xu1,∗
Angela M. Wood2,3
and Brian D. M. Tom1
1MRC Biostatistics Unit
University of Cambridge
Cambridge
U.K
2Cardiovascular Epidemiology Unit
University of Cambridge
Cambridge
U.K
3NIHR Blood and Transplant Research Unit in Donor Health and Genomics
Cambridge
U.K
Abstract
Precision medicine is an emerging field that takes into account individual heterogeneity to inform better clinical practice. In clinical trials, the evaluation of treatment effect heterogeneity is an important component, and recently, many statistical methods have been proposed for stratifying patients into different subgroups based on such heterogeneity. However, the majority of existing methods developed for this purpose focus on the case with a dichotomous treatment and are not directly applicable to multi-arm trials. In this paper, we consider the problem of patient stratification in multi-arm trial settings and propose a two-stage procedure within the Bayesian nonparametric framework. Specifically, we first use Bayesian additive regression trees (BART) to predict potential outcomes (treatment responses) under different treatment options for each patient, and then we leverage Bayesian profile regression to cluster patients into subgroups according to their baseline characteristics and predicted potential outcomes. We further embed a variable selection procedure into our proposed framework to identify the patient characteristics that actively “drive” the clustering structure. We conduct simulation studies to examine the performance of our proposed method and demonstrate the method by applying it to a UK-based multi-arm blood donation trial, wherein our method uncovers five clinically meaningful donor subgroups.
keywords:
Bayesian additive regression trees, Blood donation; Clustering; Patient heterogeneity; Precision medicine, Subgroup.
1 Introduction
Precision medicine is a paradigm that leverages patient heterogeneity to better inform clinical practice through data-driven, evidence-based, and scientifically-rigorous approaches, and it has gained much popularity recently partly due to the wealth of biological, clinical and epidemiological data now available. In particular, patient stratification based on patients’ treatment response profiles and baseline characteristics can greatly facilitate the understanding and characterization of the underlying patient heterogeneity, and thus can be of great interest and importance for a variety of medical applications (Sies et al., 2019).
Our research was motivated by a UK-based, multi-arm trial called INTERVAL (Moore et al., 2014), which was the first randomized trial to investigate the effect of different inter-donation intervals on blood supply and donor health. In INTERVAL, male donors were randomly assigned to 12-week (standard), 10-week, or 8-week inter-donation intervals, and female donors to 16-week (standard), 14-week, or 12-week inter-donation intervals. The outcomes of this trial include the amount of blood collected, and the number of low hemoglobin deferrals (temporary suspension of donors from giving blood) during the trial period. INTERVAL trial participants were well-characterized at baseline, providing an opportunity to investigate donor stratification and characterize different types of donors. This can be important for achieving more effective targeted recruitment of donors and increased efficiency of blood collection in the future. For example, it would be useful to stratify donors into those who have the capacity to donate more often than the current clinical practice (often referred to as “super donors” who have high donation capacity and minimal number of deferrals) and those who tend to be deferred more frequently than the average donors (often referred to as “brittle donors” who struggle more than the average donors to replace iron stores after blood donation).
A number of data-driven methods for investigating patient heterogeneity with clinical trial data have been proposed (for a review, see Sies et al. (2019)). However, to our knowledge, most of these methods were developed for two-arm trials, and they have not been extended to the situation with multi-level treatments, despite the popularity of multi-arm trials in clinical practice (Baron et al., 2013).
In this paper, we propose a method for leveraging treatment effect heterogeneity to identify patient subgroups in multi-arm trial settings from a clustering perspective. As a remark, we will use the terms “subgroup” and “cluster” interchangeably throughout this paper. In practice, the functional form of the relationship between treatment responses and covariates is typically unknown a priori . To this end, we leverage Bayesian nonparametric approaches. In particular, our proposed method is a two-stage procedure. In the first stage, we use Bayesian additive regression trees (BART) to predict each patient’s potential outcomes (treatment responses) under different treatment options. In the second stage, we employ a Bayesian mixture model called “profile regression” (Molitor et al., 2010) to cluster patients into subgroups according to their covariate profiles and predicted potential outcomes obtained in the first stage. We also extend the second stage clustering model to incorporate the variable selection feature, via which covariates that actively “drive” the clustering structure can be identified (distinguished from covariates that have similar profiles across all clusters).
Our method has several desirable properties. First, unlike conventional unsupervised clustering methods, our proposed approach allows the outcome variable to inform cluster membership, which ensures that resulting clusters are associated with the target outcome of clinical interest and are clinically meaningful. Second, our clustering model is based on the Dirichlet process prior, in which case the number of clusters can be directly inferred from the data, thus sidestepping the difficulty of pre-specifying the number of clusters in classical clustering approaches. Third, the proposed approach can handle correlated patient characteristics (models joint effects and the unit of inference is covariate profile) and correlations do not undermine model performance and interpretability. Finally, our method takes into account the uncertainty associated with the number of clusters and cluster assignments, and the uncertainty of the “representative” clustering can be quantified via model-averaging approaches (Molitor et al., 2010).
This paper is structured as follows. In Section 2, we introduce the statistical framework. The numerical performance of the method is evaluated by simulation studies in Section 3. An application of the proposed method to the INTERVAL data is presented in Section 4. We conclude the paper with a discussion in Section 5.
2 Methodology
We consider a clinical trial with treatments (randomized groups) and subjects. We let denote the treatment assignment, denote the covariate data, and denote the observed outcome of interest. The potential outcome under treatment is denoted by . Our proposed two-stage patient stratification method is detailed below in Sections 2.1 and 2.2. A graphical representation of the proposed approach is presented in Web Appendix A (Web Figure 1).
2.1 First stage: BART - predict potential outcomes
In the first stage, we fit a flexible regression model for given and in order to get the potential outcome estimates for subjects. We denote the predicted potential outcomes under treatments by , respectively. In theory, can be obtained from the observed data using any supervised machine learning or regression algorithms (Künzel et al., 2019). We use BART in our implementation (Hill, 2011).
We note that can also be obtained by fitting a separate model for under each treatment option (“separate-learner”) as opposed to fitting one single model for using the data from all subjects (“single-learner”) (Künzel et al., 2019). These two ways of predicting potential outcomes yield similar results in our numerical studies, and we only present the results based on the “single-learner” throughout this paper.
2.2 Second stage: Bayesian profile regression - clustering
Molitor et al. (2010) proposed a Bayesian mixture model called “profile regression”, which links the outcome vector to a set of possibly correlated covariates nonparametrically through cluster membership. Profile regression clusters subjects into subgroups by leveraging the Dirichlet process mixture model (DPMM). In profile regression, the covariates and the outcome are modeled jointly, and thus both influence cluster allocations. This is appealing, since covariate data typically give rise to many different clustering structures, and the inclusion of the target outcome information in clustering can be particularly helpful for ensuring that the clustering result has clinical utility with regard to our aim (Bair, 2013).
Our second stage clustering model follows the extension of profile regression to the case with a multivariate normal outcome (Rouanet et al., 2020). We use to denote the multivariate normal response variable in the profile regression model. The likelihood function is given by
[TABLE]
where is the index of mixture components, is a vector of mixture weights, denotes cluster-specific parameters in the density function for , and denotes cluster-specific parameters in the density function for . We note that by construction, DPMM allows infinite components. Therefore, the summation in (1) goes from to infinity.
2.2.1 Specification of model components
In the following, we describe the model for each component of (1), but we will not provide further details on the Bayesian computation aspect in this paper. We refer the interested readers to Liverani et al. (2015) and Rouanet et al. (2020) for computational details.
Mixture weights
The mixture weights () are modeled according to the stick-breaking construction of the Dirichlet process (DP) as follows (Sethuraman, 1994):
[TABLE]
where are independent random variables. Under this representation, is the concentration parameter of DP, which reflects the dispersion level and controls the number of non-empty clusters implicitly (Teh, 2010; Hastie et al., 2015; Frühwirth-Schnatter and Malsiner-Walli, 2019). We follow Rouanet et al. (2020) and adopt a Gamma prior for .
The model for covariates
The profile regression model can handle both continuous and discrete covariates. In the following, we consider the case where consists of continuous covariates and discrete covariates. In order to describe the complete covariate model, we assume that for now. The situations with or will be discussed later. We let and denote the subset of continuous and discrete covariates in , respectively. Without loss of generality, and for notational convenience, we assume that the first covariates in are continuous, i.e. (the covariate in ) is continuous for , and is discrete for .
We assume that and are independent conditional on the cluster assignments, and then the “density” for covariates can be written as:
[TABLE]
where represents the cluster-specific parameter set for covariates. In particular, and are the mean vector and the covariance matrix for continuous covariates in cluster , and denotes the parameter for discrete covariates in cluster . Note that we add “” in the subscripts of the parameters for continuous covariates (i.e. and ) in order to distinguish them from the parameters for the outcome model. If all the covariates in are continuous (i.e. =0), the right-hand side of Equation (2) will be replaced by . On the other hand, if all the covariates in are discrete (i.e. =0), the right-hand side of Equation (2) will be replaced by .
For , we assume the following probability density function:
[TABLE]
The conjugate normal-inverse-Wishart (NIW) prior is used for inference, i.e.
[TABLE]
where denotes the multivariate normal distribution, denotes the inverse-Wishart distribution, denotes the normal-inverse-Wishart distribution, and , , , and (matrix) are hyperparameters.
For discrete covariates, we assume that they are locally independent (independent conditional on the cluster assignments). Then the probability mass function of is given by
[TABLE]
where denotes the probability that covariate takes the value in cluster , , ( denotes the number of categories for covariate ). Following Liverani et al. (2015), we adopt the conjugate Dirichlet prior, i.e.
[TABLE]
where , .
The model for the outcome
We let denote the cluster-specific parameters for the multivariate normal outcome and we consider a multivariate normal (MVN) model for , i.e.
[TABLE]
Similar to , we specify the conjugate normal-inverse-Wishart prior for inference. Specifically,
[TABLE]
where , , and are hyperparameters. We note that in the profile regression model, we do not impose any specific covariance structure on , and is estimated in a data-driven way.
2.2.2 Post-processing of the clustering output
Identify the “representative” clustering
We note that the proposed Bayesian mixture modeling framework (stochastic) takes into account the uncertainty associated with the number of clusters and cluster assignments, and the clustering output can vary across iterations of the Markov chain Monte Carlo (MCMC) sampler. To facilitate the interpretation of the clustering results, we “summarize” the clustering output across iterations and identify a “representative” clustering structure. In each MCMC iteration, we can construct an score matrix, where the element equals to 1 if individuals and are allocated to the same cluster in this iteration, and equals to 0 if and are assigned to different clusters. We can then average the score matrices over all MCMC iterations to get a posterior similarity matrix, , which records the probability that two individuals are assigned to the same cluster (i.e. records the posterior co-clustering probabilities of all pairs of individuals). The “representative” clustering can be identified as the partition of the data that best represents . Specifically, we use the partitioning around medoids (PAM) algorithm (Kaufman and Rousseeuw, 1990): PAM is directly applied to the posterior dissimilarity matrix , which allocates individuals to clusters in a way consistent with . For each fixed number of clusters up to a prespecified maximum, we select the best PAM partition that minimizes the sum of dissimilarities between the center of each cluster and all other members of the same cluster. The final “representative” clustering is then chosen by maximizing the average silhouette width (Rousseeuw, 1987) across these best PAM partitions (Molitor et al., 2010).
Quantify the uncertainty associated with the “representative” clustering
We can evaluate how confident we are about the “representative” clustering by examining whether or not across different iterations the second-stage clustering model consistently clusters individuals in a way similar to the “representative” clustering, and we would expect the credible intervals associated with cluster parameter estimates to be narrower for more consistent clustering (stronger clustering signal). To this end, the model-averaging approach discussed in Molitor et al. (2010) can be employed under our proposed framework.
2.2.3 Variable selection in profile regression
Variable selection methods can be embedded into profile regression in order to identify the covariates that contribute significantly to the formation of clusters. In our implementation, we follow the variable selection approach taken by Liverani et al. (2015).
Continuous covariates
We let , where is a binary random variable that determines whether or not covariate , , is important for allocating subjects to cluster ( if the answer is yes and 0 otherwise). Let denote the average value of covariate (sample average), for , and we define , where
[TABLE]
where is the element of , . We then replace in (3) with . We assume that , . A sparsity inducing prior is used for (Papathomas et al., 2012; Liverani et al., 2015).
Discrete covariates
We can perform variable selection on discrete covariates in a similar manner. We let , where denotes a binary variable indicating whether or not covariate , , is important for assigning subjects to cluster . Let denote the probability that covariate takes the value in cluster , and be the observed proportion of covariate taking the value , , . To incorporate the variable selection feature, the discrete covariate model (4) is modified accordingly to
[TABLE]
Similar to the continuous covariate case, we assume that , and each is assigned a sparsity inducing prior, .
3 Simulation studies
In this section, we evaluate the performance of our proposed method via simulation studies.
3.1 Simulation design
We consider two simulation scenarios. The first scenario corresponds to the case where all covariates are continuous, and we evaluate the performance of the proposed method under different levels of correlations between covariates. In the second scenario, covariates include a mix of continuous and discrete ones and cluster sizes are unequal. We also extend this scenario to evaluate the performance of the variable selection procedure embedded in our method. We examine two sample sizes: and , and we repeat the simulation 100 times for each scenario. Details on the simulation design are given in the following.
3.1.1 Scenario 1
In this setting, we consider a three-arm trial () where treatment is sampled from {1, 2, 3} with equal probabilities. consists of 3 normally-distributed continuous covariates: , , and . There are 3 underlying clusters of equal size under this scenario. The mean values of covariates and expected potential outcomes under each treatment option for 3 clusters are summarized in Table 1 (a). We also plot the expected treatment response profiles of subjects in each cluster in Figure 1 (a) for better visualization. The observed outcome is simulated from a normal distribution with mean and variance . We consider different noise levels in covariates () and noise levels in the outcome (), which reflect different degrees of cluster separability. In addition, we vary the correlation between and (, 0.5, 0.8 conditional on the cluster assignment; other pairwise correlations are set as 0 conditional on the cluster assignment) to examine the influence of the degree of between-covariate correlations on our proposed method’s performance.
3.1.2 Scenario 2
In the second setting, we consider a four-arm trial () where treatment is sampled uniformly from {1, 2, 3, 4}. There exist 4 clusters, and the sizes of clusters 1-4 are , , , and , respectively.
We consider the situation where there are four covariates, all of which inform the clustering structure. In particular, and are binary variables that take on the value of 0 or 1, is a categorical variable taking on values 0, 1, or 2, and is a normally-distributed continuous variable. In Table 1 (b), we summarize the covariate profiles and expected treatment response profiles for 4 clusters. The expected treatment response profiles (by cluster) are also plotted in Figure 1 (b). The observed outcome follows a normal distribution with mean and variance . Four different combinations of noise levels in and noise levels in the outcome are considered, namely, , , , . We also consider a modification of scenario 2 to evaluate the performance of the proposed variable selection method. Details are provided in Web Appendix C.
3.2 Evaluation metrics
The performance of our proposed method in simulation studies is evaluated based on the following metrics:
Estimated number of clusters, . 2. 2.
Adjusted Rand index (ARI), which measures the similarity between the computed clustering structure and the ground truth (i.e. ideal clustering, we will refer to it as “class structure” to distinguish it from the computed clustering structure) (Hubert and Arabie, 1985). ARI equals to 1 if two partitions of the data are identical and a larger value of ARI indicates a higher level of agreement between two partitions. 3. 3.
Conditional entropy-based metrics (Rosenberg and Hirschberg, 2007):
- (a)
Homogeneity, which measures whether all members of a given computed cluster are from the same class (ground truth). Homogeneity score is larger (more desirable) when each cluster contains elements of fewer classes. 2. (b)
Completeness, which measures whether all members of a given class (ground truth) are in the same computed cluster. Completeness score is larger (more desirable) when the members of a given class are allocated to fewer clusters.
Both homogeneity and completeness are bounded between 0 and 1.
Details on how to compute ARI, homogeneity, and completeness are provided in Web Appendix B.
3.3 Simulation results
In each simulation replicate, BART is run for 6000 iterations to predict the potential outcomes and profile regression is run for 2000 iterations to cluster subjects into subgroups. For both BART and profile regression, the first 1000 iterations are discarded as burn-in. We use the default values specified in the BART package and the PReMiuMar package for all hyperparameters. Investigation of posterior samples suggests no evidence against convergence in our simulation studies.
3.3.1 Scenario 1
Simulation results for scenario 1 are presented in Table 2.
The results are fairly robust across different levels of correlations () among covariates. This is a highly desirable property of our proposed approach compared to many other standard approaches, whose performance can be very sensitive to the degree of correlations among covariates due to the multicollinearity problem (Molitor et al., 2010).
The column shows that our proposed method over-estimates the true number of clusters ( in this scenario), especially when the sample size is large (). This is not surprising given that the profile regression adopts a Dirichlet process mixture model (DPMM): it has been demonstrated that when the true number of clusters is finite and small, the posterior inference on the number of clusters by using the DPMM may be inconsistent, and DPMM tends to over-estimate the true number of clusters and produce some small extraneous clusters around the true components (Onogi et al., 2011; Miller and Harrison, 2013; Yang et al., 2019; Lu et al., 2018). This phenomenon is referred to as “over-clustering” by Lu et al. (2018), which might be due to the sensitivity of DPMM to even minor deviations that exist among clusters. As has been discussed in Section 2.2.1, the choice of the concentration parameter implicitly affects the expected number of (non-empty) clusters. Results presented in Table 2 are obtained by assuming a Gamma(2,1) prior for (i.e. the default in the PReMiuMar package; ). We also examine the empirical results obtained under other commonly-used priors for , for example, Gamma(2,4) () suggested by Escobar and West (1995), and the prior that is matched to the sparse finite mixture model (Frühwirth-Schnatter and Malsiner-Walli, 2019). The estimated number of clusters, , are very similar across different prior specifications for (not presented in the paper), even though based on theoretical considerations, the results might differ (Frühwirth-Schnatter and Malsiner-Walli, 2019).
Not surprisingly, homogeneity is higher (better) than completeness in most cases, given that is typically over-estimated by our method. In this sense, it is more likely that subjects who are clustered together are from the same class (the ground truth in simulation studies), and thus leading to “more homogeneous” clusters. On the other hand, small extraneous clusters (centered around true clusters) that are produced by DPMM may explain the “less complete” clustering results.
As expected, higher noise levels in covariates () and outcomes () result in worse clustering performance (i.e. lower ARI, completeness, and homogeneity, and a larger upward bias in the estimation of ), given that we would expect the underlying clustering structure to be less clear (lower cluster separability) with larger values of and .
A comparison of the results obtained when with those obtained when suggests that larger sample sizes do not seem to improve the performance of our method in this setting. When the noises in covariates and the outcome are large (), ARI and completeness even get worse as the sample size increases. One possible reason for this observation is that the over-estimation of by DPMM is more pronounced when , and this has a subsequent (negative) effect on ARI and completeness.
In addition to measuring the overall clustering accuracy of our proposed method based on the clustering performance metrics described in Section 3.2, we also examine how well our method performs in terms of recovering the true underlying cluster-specific parameters (e.g. means). To this end, in each simulation replicate, we first obtain the posterior summaries of mean parameters for , , , , , and in each resulting cluster, and these cluster-specific parameters are then re-weighted by the cluster sizes. We note that the re-weighting step is important: posterior summary statistics for larger clusters should be assigned more weight since larger clusters carry more information. The densities of the re-weighted results over 100 simulation replicates in low () and high () noise settings are plotted in Figure 2 ((a) for covariates , , and and (b) for potential outcomes under each treatment option). These plots correspond to the case with and . Density plots corresponding to other and values look similar (do not alter our conclusions) and are thus omitted.
Figure 2 implies that the true cluster-specific means (dashed lines) for all covariates and potential outcomes can be recovered by our proposed method, even though the estimated number of clusters is greater than the truth. When the noise level is high (), the densities are flatter (nosier) compared to the case with low noise levels (), but our method still manages to recover the truth.
3.3.2 Scenario 2
Simulation results under scenario 2 are summarized in Table 3.
As in the first scenario, the homogeneity score is almost always higher than the completeness score due to “over-clustering”. When the noise level is low (), the clusters are well-separated and our proposed method performs almost perfectly in terms of ARI, completeness and homogeneity, despite the over-estimation of remaining a problem (for this scenario, ). As the noise level increases, the clustering performance gets worse, and it seems to be more sensitive to the change in than to the change in (when we compare the results corresponding to with those corresponding to ). We also observe that first increases and then decreases as the noise gets larger, possibly because our proposed algorithm merges some clusters when the true underlying clustering structure has very low separability (i.e. very large noise).
4 Application to the INTERVAL trial
We apply our proposed patient stratification approach to the data from the INTERVAL trial. The purpose of this analysis is to uncover donor subgroups with different baseline characteristics and potential “treatment” (inter-donation interval in the blood donation context) response profiles. Specifically, we focus our analysis on a “much-in-demand but vulnerable” donor population of 884 female donors who were younger than 40 (more at risk of iron deficiency after donating blood) and had O negative blood type (the “universal” blood group that can be transfused to any patient in need and used in medical emergencies). The three randomized groups for female donors are 16-week, 14-week, and 12-week inter-donation intervals. The target outcome of our interest is a utility score (denoted by ) that accounts for the trade-off between the total units of blood collected per donor over the 2-year trial period (the benefit outcome, denoted by ) and the number of low hemoglobin (Hb) deferrals per donor during the same period (the risk outcome, denoted by ), i.e. , where is the trade-off parameter reflecting the equivalent benefit loss for one unit increase in the risk. We examine the case with to reflect the potential costs of reduced efficiency of blood collection and reduced donor retention due to low Hb deferrals. Seven baseline donor characteristics are considered, including the Short Form Health Survey version 2 (SF-36v2) physical component score (PCS), mental component score (MCS), ferritin level, red blood cell count (RBC), mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and body mass index (BMI).
In our analysis, we use the default setups that are specified in the BART package and the PReMiuMar package for priors and hyperparameters. We run BART for 6000 MCMC iterations with an initial burn-in of 1000 iterations to predict potential outcomes, and run profile regression for 40000 iterations with a burn-in of 10000 iterations in the clustering step. We do not find strong evidence against convergence. More details on convergence diagnostics are given in Web Appendix D.
Unlike in the simulation studies where we can evaluate the performance of our method by calculating external metrics such as ARI, homogeneity, and completeness, in the real data application, the ground truth is not known and thus these external validation metrics cannot be used. Instead, we focus on the interpretation of the “representative” clustering of 884 female donors in order to assess whether or not our proposed approach can give insights into donor heterogeneity and stratify donors into clinically meaningful subgroups.
Inspection of the raw output from the profile regression model (before applying the post-processing method described in Section 2.2.2) suggests that at each MCMC iteration, there are either 6 or 7 resulting non-empty clusters, with 5 moderately-sized clusters and 1 or 2 very small clusters. In particular, for all MCMC iterations, one of the resulting clusters contains only 1 donor whose “red blood cell-related” blood measurements are fairly extreme (lowest RBC, highest MCH and highest MCV among all 884 female donors). The other small cluster (if it exists, i.e. when the total number of non-empty clusters is 7) includes donors with extremely low PCS (the size of this cluster varies slightly across different iterations, but it is always less than 8).
We apply the post-processing method discussed in Section 2.2.2 to identify the “representative” clustering based on posterior samples. The heatmap of the posterior similarity matrix for 884 female donors is presented in Web Appendix E (Web Figure 4). The “representative” clustering consists of 5 donor subgroups, and the sizes of these subgroups are 171, 126, 101, 93, and 393, respectively. This indicates that the 1 or 2 very small cluster(s) in the raw output are merged into larger clusters in this post-processing step.
Figure 3 shows the covariate profiles (posterior distributions of the mean parameters for donors’ baseline characteristics) and the potential outcome profiles (posterior distributions of the mean parameters for potential outcomes under 16-, 14-, and 12-week inter-donation intervals) corresponding to each of the five donor subgroups. These results reveal considerable evidence for the presence of heterogeneity within the donor population under investigation.
Cluster 4 (with 126 donors) represents the potentially “super donor” subgroup who can give blood more frequently than the current clinical practice and for these donors, more frequent donations lead to larger utilities (Figure 3 (b)). Comparing across subgroups, this group of donors has higher donation capacity since the utility scores are on average higher, especially under the 14-week and 12-week inter-donation intervals. These “super donors” are characterized by having high levels of ferritin, RBC, BMI, MCS, MCV and MCH and low levels of PCS. In particular, the most distinguishing feature that differentiates this subgroup from the remaining subgroups is the ferritin level (significantly higher in this subgroup). This is consistent with a priori expectation that donors with higher ferritin levels are in general more capable of donating blood more often (especially for female donors).
Cluster 2 is the largest subgroup with 393 donors. Figure 3 (b) suggests that donors in this subgroup have the potential to donate blood every 12 weeks, even though the gain in the utility score by switching from the 16-week inter-donation interval to the 12-week inter-donation interval is only moderate and not as significant compared to cluster 4. Consistent with cluster 4, donors in cluster 2 on average have high MCS, MCV, and MCH. However, their ferritin levels, RBC and BMI are lower than those of donors in cluster 4 (Figure 3 (a)).
Cluster 1 (with 101 donors) represents a potentially “brittle donor” subgroup. Utility score gets smaller as the inter-donation interval gets shorter (Figure 3 (b)). The low ferritin levels, RBC and MCS of these donors (Figure 3 (a)) may potentially explain why they are vulnerable. As an aside, the cluster with only one donor in the raw output before post-processing is merged into this “brittle donor” subgroup. This is sensible since blood-based measurements of this donor suggest that she may be “vulnerable”.
Cluster 5 (with 171 donors) also represents a subgroup of donors with low donation capacity. The utility scores of donors in this subgroup are similar under three inter-donation interval options and they are lower than the utility scores of donors in clusters 2, 3, and 4 (Figure 3 (b)). Donors in cluster 5 (on average) have a high MCS similar to that of donors in cluster 4. However, low values of donation capacity-related characteristics such as ferritin levels, MCV, MCH, and BMI (Figure 3 (a)) may lead to higher than average deferral rates and lower than average utility scores in this subgroup.
The utility scores of donors in cluster 3 (with 93 donors) are almost identical under 16-week and 14-week inter-donation intervals but decrease significantly when the donation frequency is every 12 weeks (Figure 3 (b)). The ferritin levels, MCV, MCH, and BMI are relatively high (Figure 3 (a)) for these donors, and the drop in the utility score when the inter-donation interval is 12-week may be attributed to the low MCS.
To summarize, our analysis of the INTERVAL data using the proposed approach reveals clinically meaningful donor subgroups within the “much-in-demand but vulnerable” donor population under investigation.
5 Discussion
The uncovering of subgroups from a heterogeneous population plays an important role in precision medicine applications. In this paper, we present a two-stage patient stratification approach that leverages Bayesian nonparametric techniques. Our proposed method captures the heterogeneity in the underlying population and clusters the population into subgroups of subjects who share similar covariate profiles and display similar treatment responses. Specifically, in the first stage, we predict the potential outcome under each treatment arm for each patient, and in the second stage, we apply profile regression to link the multivariate potential outcome vector to a set of covariates (can be continuous, discrete, or a mix of continuous and discrete ones) through cluster membership (Molitor et al., 2010). Based on the posterior samples, the resulting subgroups can be characterized in terms of covariate and treatment response profiles.
Our method offers several advantages. Firstly, while most existing methods for subgroup identification only cover two-treatment cases, our proposed approach is applicable to multi-arm trials. Secondly, the use of the Dirichlet process prior allows the number of clusters to be estimated from the data, thus bypassing the need for pre-specifying it. Thirdly, our method can properly handle correlated covariates (avoiding well-known problems caused by multicollinearity), which are common in clinical studies. Fourthly, a variable selection procedure is embedded into our model for identifying important covariates that actively “drive” the clustering structure (i.e. contribute significantly to the cluster patterns). Lastly, the proposed approach is built under the Bayesian framework and takes into account model uncertainties (Molitor et al., 2010).
The application of our proposed method to a subset of the INTERVAL data (a “much-in-demand but vulnerable” donor population) identifies 5 clinically meaningful donor subgroups with different donation capacities and covariate (donors’ baseline characteristics) profiles. These results provide insight into the underlying donor heterogeneity by highlighting the differences between donors in terms of both baseline characteristics and potential response (to three different inter-donation intervals) patterns, and can be leveraged to inform and guide targeted donor recruitment and donor management strategy. For example, donors who are identified as belonging to the “super donor” subgroup may be asked to give blood more frequently if there is a blood shortage or if their blood group is rare or universal. In contrast, donors who belong to the “brittle donor” subgroup will be allowed longer time between donations to ensure donor health and safety (BTRU, 2019).
The use of the Dirichlet process prior in the second-stage profile regression model allows the number of clusters to be discovered in a data-driven way. However, as demonstrated by our simulation studies, Dirichlet process mixture model (DPMM) tends to over-estimate the number of clusters (produce some superfluous small-sized clusters). Indeed, the inconsistent inference (over-estimation) on the number of components by DPMM is a well-known problem (referred to as “over-clustering” in Lu et al. (2018)) that has been empirically observed and reported in the literature before, and it appears that in most cases, the extra clusters are centered around the true components and only include a very small number of individuals (Onogi et al., 2011; Miller and Harrison, 2013). To our knowledge, how to correct for such inconsistency remains an open question in the field of Bayesian mixture modeling (Miller and Harrison, 2013; Yang et al., 2019). In general, we think that a slight over-estimation of the true number of clusters is not an issue of major concern in the context of patient stratification, since in this case, the primary interest typically lies in the characterization and the interpretation of clusters rather than the inference on the exact number of underlying clusters. As has been noted by Onogi et al. (2011), even though the extra clusters produced by DPMM are considered as redundant and interpreted as over-estimation in simulation studies, they may provide useful information in real data applications since they reflect some relatively subtle heterogeneity that might be clinically interesting. In practice, in order to achieve better interpretability of the clustering results, the model-based “representative” clustering should be coupled with practitioners’ subject-matter knowledge when determining the optimal number of clusters. If the inspection of the output indicates that the small clusters produced by DPMM are not of much clinical interest because they do not reflect a general pattern, the other larger and more representative clusters will be given more emphasis in the interpretation. In addition, if the profiles of some small clusters are very similar to those of some much larger clusters in terms of clinical meaningfulness, we can merge them a posteriori. From our point of view, this would be preferable to a method of low granularity, in which case some useful information on the population heterogeneity may be overlooked.
The work in this paper has raised new research questions that are worthy of further investigation. Although our method is primarily developed for the setting with a univariate and continuous target outcome, it can be extended to the multi-outcome setting.
Since we use the posterior mean of the potential outcome predictions (from BART) rather than the full posterior samples as the outcome in the profile regression model, the uncertainties associated with the predicted potential outcomes (the first stage) are not carried forward to the second stage and thus not reflected in the final output. We may employ the idea of Markov melding (a generic Bayesian computational method for evidence synthesis) to allow uncertainty propagation (Goudie et al., 2019).
Numerical experiments suggest that our proposed patient stratification method does not scale well to large datasets, mainly due to the computational inefficiency of the current implementation of profile regression with a multivariate normal outcome in the PReMiuMar package. MCMC sampling can be prohibitively slow, and a variational inference algorithm for Dirichlet process mixture models may be considered to speed up the computation (Blei and Jordan, 2006).
In the profile regression model, both covariates () and outcome () inform the clustering structure. If the dimension of is much higher than that of , the contribution of to the likelihood is likely to be overwhelmed by that of (i.e. covariates might dominate the likelihood and the relative contribution of the outcome may be undermined). Consequently, the impact of the response data on the cluster allocation will be small and the resulting clusters will be formed mainly based on the similarity in the covariate space. Bigelow and Dunson (2009) argued that this might be a desirable property for some epidemiological studies. However, in some other cases where we expect the outcome to play a more important role in the clustering, we might need to upweight the outcome likelihood. In practice, the weight may be subjective and may depend heavily on the contexts and research aims.
Acknowledgments
This work was supported by the UK Medical Research Council programme MC_UU_00002/2 and the Cambridge International Scholarship. Participants in the INTERVAL trial were recruited with the active collaboration of NHS Blood and Transplant England (www.nhsbt.nhs.uk), which has supported field work and other elements of the trial. The academic coordinating centre for INTERVAL was supported by core funding from: NIHR Blood and Transplant Research Unit in Donor Health and Genomics (NIHR BTRU-2014-10024), UK Medical Research Council (MR/L003120/1), British Heart Foundation (SP/09/002; RG/13/13/30194; RG/18/13/33946) and the NIHR [Cambridge Biomedical Research Centre at the Cambridge University Hospitals NHS Foundation Trust]. A complete list of the investigators and contributors to the INTERVAL trial is provided in Di Angelantonio et al. (2017). The academic coordinating centre would like to thank blood donor centre staffs and blood donors for participating in the INTERVAL trial. This work was also supported by Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and Wellcome. The views expressed in this paper are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Supporting Information
Additional supporting information may be found online in the Supporting Information section at the end of the article. The R code for implementing the proposed patient stratification method is available at https://github.com/yx299/stratification.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bair (2013) Bair, E. (2013). Semi-supervised clustering methods. Wiley Interdisciplinary Reviews: Computational Statistics 5, 349–361.
- 2Baron et al. (2013) Baron, G., Perrodeau, E., Boutron, I., and Ravaud, P. (2013). Reporting of analyses from randomized controlled trials with multiple arms: a systematic review. BMC Medicine 11, Article 84.
- 3Bigelow and Dunson (2009) Bigelow, J. L. and Dunson, D. B. (2009). Bayesian semiparametric joint models for functional predictors. Journal of the American Statistical Association 104, 26–36.
- 4Blei and Jordan (2006) Blei, D. M. and Jordan, M. I. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis 1, 121–143.
- 5BTRU (2019) BTRU (2019). Understanding donor characteristics.
- 6Di Angelantonio et al. (2017) Di Angelantonio, E., Thompson, S. G., Kaptoge, S., Moore, C., Walker, M., Armitage, J., et al. (2017). Efficiency and safety of varying the frequency of whole blood donation (INTERVAL): a randomised trial of 45000 donors. Lancet 390, 2360–2371.
- 7Escobar and West (1995) Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90, 577–588.
- 8Frühwirth-Schnatter and Malsiner-Walli (2019) Frühwirth-Schnatter, S. and Malsiner-Walli, G. (2019). From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering. Advances in Data Analysis and Classification 13, 33–64.
