Recovering Sparse and Interpretable Subgroups with Heterogeneous Treatment Effects with Censored Time-to-Event Outcomes
Chirag Nagpal, Vedant Sanil, Artur Dubrawski

TL;DR
This paper introduces a new statistical method for identifying sparse, interpretable subgroups with different treatment effects in censored time-to-event data, enhancing personalized treatment analysis in clinical studies.
Contribution
It proposes a mixture model with structured sparsity regularization and a novel inference procedure for recovering phenogroups with differential effects in censored survival data.
Findings
Successfully recovers sparse phenotypes in cardiovascular studies
Demonstrates effectiveness on large real-world clinical datasets
Improves understanding of subgroup-specific treatment effects
Abstract
Studies involving both randomized experiments as well as observational data typically involve time-to-event outcomes such as time-to-failure, death or onset of an adverse condition. Such outcomes are typically subject to censoring due to loss of follow-up and established statistical practice involves comparing treatment efficacy in terms of hazard ratios between the treated and control groups. In this paper we propose a statistical approach to recovering sparse phenogroups (or subtypes) that demonstrate differential treatment effects as compared to the study population. Our approach involves modelling the data as a mixture while enforcing parameter shrinkage through structured sparsity regularization. We propose a novel inference procedure for the proposed model and demonstrate its efficacy in recovering sparse phenotypes across large landmark real world clinical studies in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Statistical Methods and Inference · Statistical Methods in Clinical Trials
\clearauthor\Name
Chirag Nagpal \[email protected]
\NameVedant Sanil \[email protected]
\NameArtur Dubrawski \[email protected]
\addrAuton Lab, Carnegie Mellon
Recovering Sparse and Interpretable Subgroups with Heterogeneous Treatment Effects with Censored Time-to-Event Outcomes
Abstract
Studies involving both randomized experiments as well as observational data typically involve time-to-event outcomes such as time-to-failure, death or onset of an adverse condition. Such outcomes are typically subject to censoring due to loss of follow-up and established statistical practice involves comparing treatment efficacy in terms of hazard ratios between the treated and control groups. In this paper we propose a statistical approach to recovering sparse phenogroups (or subtypes) that demonstrate differential treatment effects as compared to the study population. Our approach involves modelling the data as a mixture while enforcing parameter shrinkage through structured sparsity regularization. We propose a novel inference procedure for the proposed model and demonstrate its efficacy in recovering sparse phenotypes across large landmark real world clinical studies in cardiovascular health.
keywords:
Time-to-Event, Survival Analysis, Heterogeneous Treatment Effects, Hazard Ratio
1 Introduction
Data driven decision making across multiple disciplines including healthcare, epidemiology, econometrics and prognostics often involves establishing efficacy of an intervention when outcomes are measured in terms of the time to an adverse event, such as death, failure or onset of a critical condition. Typically the analysis of such studies involves assigning a patient population to two or more different treatment arms often called the ‘treated’ (or ‘exposed’) group and the ‘control’ (or ‘placebo’) group and observing whether the populations experience an adverse event (for instance death or onset of a disease) over the study period at a rate that is higher (or lower) than for the control group. Efficacy of a treatment is thus established by comparing the relative difference in the rate of event incidence between the two arms called the hazard ratio. However, not all individuals benefit equally from an intervention. Thus, very often potentially beneficial interventions are discarded even though there might exist individuals who benefit, as the population level estimates of treatment efficacy are inconclusive.
In this paper we assume that patient responses to an intervention are typically heterogeneous and there exists patient subgroups that are unaffected by (or worse, harmed) by the intervention being assessed. The ability to discover or phenotype these patients is thus clinically useful as it would allow for more precise clinical decision making by identifying individuals that actually benefit from the intervention being assessed.
Towards this end, we propose Sparse Cox Subgrouping, (SCS) a latent variable approach to model patient subgroups that demonstrate heterogeneous effects to an intervention. As opposed to existing literature in modelling heterogeneous treatment effects with censored time-to-event outcomes our approach involves structured regularization of the covariates that assign individuals to subgroups leading to parsimonious models resulting in phenogroups that are interpretable. We release a python implementation of the proposed SCS approach as part of the auton-survival package (Nagpal et al., 2022b) for survival analysis:
https://autonlab.github.io/auton-survival/
2 Related Work
Large studies especially in clinical medicine and epidemiology involve outcomes that are time-to-events such as death, or an adverse clinical condition like stroke or cancer. Treatment efficacy is typically estimated by comparing event rates between the treated and control arms using the Proportional Hazards (Cox, 1972) model and its extensions.
Identification of subgroups in such scenarios has been the subject of a large amount of traditional statistical literature. Large number of such approaches involve estimation of the factual and counterfactual outcomes using separate regression models (T-learner) followed by regressing the difference between these estimated potential outcomes. Within this category of approaches, Lipkovich et al. (2011) propose the subgroup identification based on differential effect search (SIDES) algorithm, Su et al. (2009) propose a recursive partitioning method for subgroup discovery, Dusseldorp and Mechelen (2014) propose the qualitative interaction trees (QUINT) algorithm, and Foster et al. (2011) propose the virtual twins (VT) method for subgroup discovery involving decision tree ensembles. We include a parametric version of such an approach as a competing baseline.
Identification of heterogeneous treatment effects (HTE) is also of growing interest to the machine learning community with multiple approaches involving deep neural networks with balanced representations (Shalit et al., 2017; Johansson et al., 2020), generative models Louizos et al. (2017) as well as Non-Parametric methods involving random-forests (Wager and Athey, 2018) and Gaussian Processes (Alaa and Van Der Schaar, 2017). There is a growing interest in estimating HTEs from an interpretable and trustworthy standpoint (Lee et al., 2020; Nagpal et al., 2020; Morucci et al., 2020; Wu et al., 2022; Crabbé et al., 2022). Wang and Rudin (2022) propose a sampling based approach to discovering interpretable rule sets demonstrating HTEs.
However large part of this work has focused extensively on outcomes that are binary or continuous. The estimation of HTEs in the presence of censored time-to-events has been limited. Xu et al. (2022) explore the problem and describe standard approaches to estimate treatment effect heterogeneity with survival outcomes. They also describe challenges associated with existing risk models when assessing treatment effect heterogeneity in the case of cardiovascular health.
There has been some initial attempt to use neural network for causal inference with censored time-to-event outcomes. Curth et al. (2021) propose a discrete time method along with regularization to match the treated and control representations. Chapfuwa et al. (2021)’s approach is related and involves the use of normalizing flows to estimate the potential time-to-event distributions under treatment and control. While our contributions are similar to Nagpal et al. (2022a), in that we assume treatment effect heterogeneity through a latent variable model, our contribution differs in that 1) Our approach is free of the expensive Monte-Carlo sampling procedure and 2) Our generalized EM inference procedure allows us to naturally incorporate structured sparsity regularization, which helps recovers phenogroups that are parsimonious in the features they recover that define subgroups.
Survival and time-to-event outcomes occur pre-eminently in areas of cardiovascular health. One such area is reducing combined risk of adverse outcomes from atherosclerotic disease111A class of related clinical conditions from increasing deposits of plaque in the arteries, leading to Stroke, Myorcardial Infarction and other Coronary Heart Diseases. (Herrington et al., 2016; Furberg et al., 2002; Group, 2009; Buse et al., 2007) The ability of recovering groups with differential benefits to interventions can thus lead to improved patient care through framing of optimal clinical guidelines.
3 Proposed Model: Sparse Cox Subgrouping
Notation
As is standard in survival analysis, we assume that we either observe the true time-to-event or the time of censoring indicated by the censoring indicator defined as . We thus work with a dataset of right censored observations in the form of 4-tuples, , where is the time-to-event or censoring as indicated by , is the indicator of treatment assignment, and are individual covariates that confound the treatment assignment and the outcome.
Assumption 1** (Independent Censoring)**
The time-to-event and the censoring distribution are independent conditional on the covariates and the intervention .
Model
Consider a maximum likelihood approach to model the data the set of parameters . Under Assumption 1 the likelihood of the data can be given as,
[TABLE]
here is the hazard and is the survival rate.
Assumption 2** (PH)**
The distribution of the time-to-event conditional on the covariates and the treatment assignment obeys proportional hazards.
From Assumption 2 (Proportional Hazards), an individual with covariates under intervention under a Cox model with parameters and treatment effect is given as
[TABLE]
Here, is an infinite dimensional parameter known as the base survival rate. In practice in the Cox’s model the base survival rate is a nuisance parameter and is estimated non-parametrically. In order to model the heterogeneity of treatment response. We will now introduce a latent variable that mediates treatment response to the model,
[TABLE]
Here, is the treatment effect, and is the set of parameters that mediate assignment to the latent group conditioned on the confounding features . Note that the above choice of parameterization naturally enforces the requirements from the model as in Figure 1. Consider the following scenarios,
Case 1: The study population consists of two sub-strata ie. , that are benefit and are unaffected by treatment respectively.
Case 2: The study population consists of three sub-strata ie. , that benefit, are harmed or unaffected by treatment respectively.
Following from Equations 1 & 3, the complete likelihood of the data under this model is,
[TABLE]
Note that is the infinite dimensional cumulative hazard and is inferred when learning the model. We will notate the set of all learnable parameters as .
Shrinkage
In retrospective analysis to recover treatment effect heterogeneity a natural requirement is parsimony of the recovered subgroups in terms of the covariates to promote model interpretability. Such parsimony can be naturally enforced through appropriate shrinkage on the coefficients that promote sparsity. We want to recover phenogroups that are ‘sparse’ in . We enforce sparsity in the parameters of the latent gating function via a group (Lasso) penalty. The final loss function to be optimized including the group sparsity regularization term is,
[TABLE]
Identifiability
Further, to ensure identifiability we restrict the gating parameters for the to be [math]. Thus .
Inference
We will present a variation of the Expectation Maximization algorithm to infer the parameters in Equation 3. Our approach differs from Nagpal et al. (2022a, 2021) in that it does not require storchastic Monte-Carlo sampling. Further, our generalized EM inference allows for incorporation of the structured sparsity in the M-Step.
A Semi-Parametric
Note that the likelihood in Equation 3 is semi-parametric and consists of parametric components and the infinite dimensional base hazard . We define the as:
[TABLE]
The E-Step
Requires computation of the posteriors counts
Result 1** (Posterior Counts)**
The posterior counts for the latent are estimated as,
[TABLE]
For a full discussion on derivation of the and the posterior counts please refer to Appendix B
The M-Step
Involves maximizing the function. Rewriting the as a sum of two terms,
[TABLE]
Result 2** (Weighted Cox model)**
The term can be rewritten as a weighted Cox model and thus optimized using the corresponding ‘partial likelihood’,
Updates for : The partial-likelihood, under sampling weights (Binder, 1992) is
[TABLE]
Here is the ‘risk set’ or the set of all individuals who haven’t experienced the event till the corresponding time, i.e. . is convex in and we update these with a gradient step.
Updates for : The base hazard are updated using a weighted Breslow’s estimate (Breslow, 1972; Lin, 2007) assuming the posterior counts to be sampling weights (Chen, 2009),
[TABLE]
Term is a function of the gating parameters that determine the latent assignment along with sparsity regularization. We update using a Proximal Gradient update as is the case with Iterative Soft Thresholding (ISTA) for group sparse regression.
Updates for : The proximal update for including the group regularization (Friedman et al., 2010) term is,
[TABLE]
All together the inference procedure is described in Algorithm 1.
{algorithm}
[!h] ** Parameter Learning for SCS with a Generalized EM** \SetAlgoLined\SetKwInOutInputInput \SetKwInOutreturnReturn \InputTraining set, ; maximum EM iterations, , step size
\While
<not converged> \For$b\in\{1,2,...,B\}$ **E-Step**Compute posterior counts (Equation 6).
M-Step
Gradient descent update.
Breslow (1972)’s estimator.
Update with gradient of .
Proximal update.
\returnlearnt parameters ;
4 Experiments
In this section we describe the experiments conducted to benchmark the performance of SCS against alternative models for heterogenous treatment effect estimation on multiple studies including a synthetic dataset and multiple large landmark clinical trials for cardiovascular diseases.
4.1 Simulation
In this section we first describe the performance of the proposed Sparse Cox Subgrouping approach on a synthetic dataset designed to demonstrate heterogeneous treatment effects. We randomly assign individuals to the treated or control group. The latent variable is drawn from a uniform categorical distribution that determines the subgroup,
[TABLE]
Conditioned on we sample as in Figure 2 that determine the conditional Hazard Ratios , and randomly sample noisy covariates . The true time-to-event and censoring times are then sampled as,
[TABLE]
Finally we sample the censoring indicator and set the observed time-to-event,
[TABLE]
Figure 2 presents the ROC curves for SCS’s ability to identify the groups with enhanced and diminished treatment effects respectively. In Figure 3 we present Kaplan-Meier estimators of the Time-to-Event distributions conditioned on the predicted by SCS. Clearly, SCS is able to identify the phenogroups corresponding to differential benefits.
4.2 Recovering subgroups demonstrating Heterogeneous Treatment Effects from Landmark studies of Cardiovascular Health
Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack
(Furberg et al., 2002)
The ALLHAT study was a large randomized experiment conducted to assess the efficacy of multiple classes of blood pressure lowering medicines for patients with hypertension in reducing risk of adverse cardiovascular conditions. We considered a subset of patients from the original ALLHAT study who were randomized to receive either Amlodipine (a calcium channel blocker) or Lisinopril (an Angiotensin-converting enzyme inhibitor). Overall, Amlodipine was found to be more efficacious than Lisinopril in reducing combined risk of cardio-vascular disease.
Bypass Angioplasty Revascularization Investigation in Type II Diabetes
(Group, 2009)
Diabetic patients have been traditionally known to be at higher risk of cardiovascular disease however appropriate intervention for diabetics with ischemic heart disease between surgical coronary revascularization or management with medical therapy is widely debated. The BARI2D was a large landmark experiment conducted to assess efficacy between these two possible medical interventions. Overall BARI2D was inconclusive in establishing the appropriate therapy between Coronary Revascularization or medical management for patients with Type-II Diabetes.
Figure 4 presents the event-free survival rates as well as the summary statistics for the studies. In our experiments, we included a large set of confounders collected at baseline visit of the patients which we utilize to train the proposed model. A full list of these features are in Appendix A.
4.3 Baselines
Cox PH with Regularized Treatment Interaction (cox-int)
a
We include treatment effect heterogeneity via interaction terms that model the time-to-event distribution using a proportional hazards model as in Kehl and Ulm (2006). Thus,
[TABLE]
The interaction effects are regularized with a lasso penalty in order to recover a sparse phenotyping rule defined as .
Binary Classifier with Regularized Treatment Interaction (bin-int)
a
Instead of modelling the time-to-event distribution we directly model the thresholded survival outcomes at a five year time horizon using a log-linear parameterization with a logit link function. As compared to cox-int, this model ignores the data-points that were right-censored prior to the thresholded time-to-event, however it is not sensitive to the strong assumption of Proportional Hazards.
[TABLE]
Cox PH T-Learner with Regularized Logistic Regression (cox-tlr)
a
We train two separate Cox Regression models on the treated and control arms (T-Learner) to estimate the potential outcomes under treatment and control . Motivated from the ‘Virtual Twins’ approach as in Foster et al. (2011), a logistic regression with an penalty is trained to estimate if the risk of the potential outcome under treatment is higher than under control. This logistic regression is then employed as the phenotyping function and is given as,
[TABLE]
The above models involving sparse regularization were trained with the glmnet (Friedman et al., 2009) package in R.
The ACC/AHA Long term Atheroscleoratic Cardiovascular Risk Estimate
222https://tools.acc.org/ascvd-risk-estimator-plus/
a
The American College of Cardiology and the American Heart Association model for estimation of risk of Atheroscleratic disease risk (Goff Jr et al., 2014) involves pooling data from multiple observational cohorts of patients followed by modelling the 10-year risk of an adverse cardiovascular condition including death from coronary heart disease, Non-Fatal Myocardial Infarction or Non-fatal Stroke. While the risk model was originally developed to assess factual risk in the observational sense, in practice it is also employed to assess risk when making counterfactual decisions.
4.4 Results and Discussion
Protocol
We compare the performance of SCS and the corresponding competing methods in recovery of subgroups with enhanced (or diminished treatment effects). For each of these studies we stratify the study population into equal sized sets for training and validation while persevering the proportion of individuals that were assigned to treatment and experienced the outcome in the follow up period. The models were trained on the training set and validated on the held-out test set. For each of the methods we experiment with models that do not enforce any sparsity () as well as tune the level of sparsity to recover phenotyping functions that involve and features. The subgroup size are varied by controlling the threshold at which the individual is assigned to a group. Finally, the treatment effect is compared in terms of Hazard Ratios, Risk Differences as well as Restricted Mean Survival Time over a 5 Year event period.
Results
We present the results of SCS versus the baselines in terms of Hazard Ratios on the ALLHAT and BARI2D datasets in Figures 5 and 6. In the case of ALLHAT, SCS consistently recovered phenogroups with more pronounced (or diminished) treatment effects. On external validation on the heldout dataset, we found a subgroup of patients that had similar outcomes whether assigned to Lisinopril or Amlodipine, whereas the other subgroup clearly identified patients that were harmed with Lisinopril. The group harmed with Lisinopril had higher Diastolic BP. On the other hand, patients with Lower kidney function did not seem to benefit from Amlodipine.
In the case of BARI2D, SCS recovered phenogroups that were both harmed as well as benefitted from just medical therapy without revascularization. The patients who were harmed from Medical therapy were typically older, on the other hand the patients who benefitted primarily included patients who were otherwise assigned to receive PCI instead of CABG revascularization, suggesting PCI to be harmful for diabetic patients.
Tables 3 and 4 present the features that were selected by the proposed model for the studies. Additionally, we also report tabulated results involving metrics like risk difference and restricted mean survival time in the Appendix C.
5 Concluding Remarks
We presented Sparse Cox Subgrouping (SCS) a latent variable approach to recover subgroups of patients that respond differentially to an intervention in the presence of censored time-to-event outcomes. As compared to alternative approaches to learning parsimonious hypotheses in such settings, our proposed model recovered hypotheses with more pronounced treatment effects which we validated on multiple studies for cardiovascular health.
While powerful in its ability to recover parsimonious subgroups there exists limitations in SCS in its current form. The model is sensitive to proportional hazards and may be ill-specified when the proportional hazards assumptions are violated as is evident in many real world clinical studies (Maron et al., 2018; Bretthauer et al., 2022). Another limitation is that SCS in its current form looks at only a single endpoint (typically death, or a composite of multiple adverse outcome). In practice however real world studies typically involve multiple end-points. We envision that extensions of SCS would allow patient subgrouping across multiple endpoints, leading to discovery of actionable sub-populations that similarly benefit from the intervention under assessment.
Appendix A Additional Details on the ALLHAT and BARI 2D Case Studies
Tables 1 and 2 represent additional confounding variables found in the ALLHAT and BARI2D trials respectively.
[TABLE]
[TABLE]
Appendix B Derivation of the Inference Algorithm
Censored Instances: Note that in the case of the censored instances we will condition on the thresholded survival . The the posterior counts thus reduce to:
[TABLE]
[TABLE]
Uncensored Instances The posteriors are ,
Posteriors for the uncensored data are more involved and involve the base hazard . Posteriors for uncensored data are independent of the base hazard function, as,
[TABLE]
Combining Equations 13 and B we arrive at the following estimate for the posterior counts
[TABLE]
Appendix C Additional Results
Figures 7, 8, 9 present tabulated metrics on ALLHAT with Hazard Ratio, Risk Difference and Restricted Mean Survival Time respectively. Figures 10, 11, 12 present tabulated metrics BARI2D with Hazard Ratio, Risk Difference and Restricted Mean Survival Time metrics respectively.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alaa and Van Der Schaar (2017) Ahmed M Alaa and Mihaela Van Der Schaar. Bayesian inference of individualized treatment effects using multi-task gaussian processes. Advances in neural information processing systems , 30, 2017.
- 2Binder (1992) David A Binder. Fitting cox’s proportional hazards models from survey data. Biometrika , 79(1):139–147, 1992.
- 3Breslow (1972) Norman E Breslow. Contribution to discussion of paper by dr cox. J. Roy. Statist. Soc., Ser. B , 34:216–217, 1972.
- 4Bretthauer et al. (2022) Michael Bretthauer, Magnus Løberg, Paulina Wieszczy, Mette Kalager, Louise Emilsson, Kjetil Garborg, Maciej Rupinski, Evelien Dekker, Manon Spaander, Marek Bugajski, et al. Effect of colonoscopy screening on risks of colorectal cancer and related death. New England Journal of Medicine , 2022.
- 5Buse et al. (2007) John B Buse, ACCORD Study Group, et al. Action to control cardiovascular risk in diabetes (accord) trial: design and methods. The American journal of cardiology , 99(12):S 21–S 33, 2007.
- 6Chapfuwa et al. (2021) Paidamoyo Chapfuwa, Serge Assaad, Shuxi Zeng, Michael J Pencina, Lawrence Carin, and Ricardo Henao. Enabling counterfactual survival analysis with balanced representations. In Proceedings of the Conference on Health, Inference, and Learning , pages 133–145, 2021.
- 7Chen (2009) Yi-Hau Chen. Weighted breslow-type and maximum likelihood estimation in semiparametric transformation models. Biometrika , 96(3):591–600, 2009.
- 8Cox (1972) David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) , 34(2):187–202, 1972.
