SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Shahriar Noroozizadeh; Xiaobin Shen; Jeremy C. Weiss; George H. Chen

arXiv:2603.05483·cs.LG·March 6, 2026

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Shahriar Noroozizadeh, Xiaobin Shen, Jeremy C. Weiss, George H. Chen

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

SurvHTE-Bench is the first comprehensive benchmark designed to evaluate heterogeneous treatment effect estimation methods in survival analysis, addressing challenges like censoring and unobserved counterfactuals across synthetic, semi-synthetic, and real-world datasets.

Contribution

This work introduces SurvHTE-Bench, a modular and diverse benchmark for systematically comparing survival HTE methods under various assumptions and data conditions.

Findings

01

First rigorous comparison of survival HTE methods across multiple datasets.

02

Demonstrates the impact of assumption violations on method performance.

03

Provides a foundation for fair and reproducible evaluation of causal survival analysis.

Abstract

Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is well written with a clear scope. The benchmark topic is important and interesting in the study of treatment effect on health science datasets. 2. The paper considers a large set of HTE estimators, the evaluations are done on multiple settings. 3. The paper includes a very complete reproducibilities resources.

Weaknesses

1. A complete benchmark on HTE with survival datasets is surely needed in the community. There are some benchmark papers on HTE (not on survival setting), including [Crabbé, J., et al. 2022], [Shimoni, Y., et al. 2018], [Kapkiç, A. et al. 2024] and others. However, extending the benchmark on HTE from complete datasets to right-censored datasets can be a weak improvement. There can be overlaps among those benchmarks. I believe a comprehensive benchmark named by survHTE-Bench should include method

Reviewer 02Rating 6Confidence 4

Strengths

The paper is clearly written and easy to follow. The proposed benchmark is thorough and does an excellent job of covering many different scenarios of practical relevance for HTE estimation with survival data. The motivation for the construction of the different synthetic datasets (in particular, tying each one to a combination of the "causal configuration" and survival analysis setup) makes it clear exactly what is being tested by each synthetic dataset. The semi-synthetic and real datasets are

Weaknesses

While the types of CI assumption violations considered are extensive, the ground truth CATEs used for the synthetic data generation (discussed in Appendix A.3) are somewhat limited. In particular, the functional form of the ground truth consists of mostly linear components, with a couple of instances of quadratic terms, square root terms, or threshold discontinuities. Of particular note is that there are no interaction terms between covariates. Examining the effect of more severe nonlinearities,

Reviewer 03Rating 6Confidence 4

Strengths

To my knowledge, this is the first large-scale benchmark for CATE estimation in the context of right-censored outcomes. The benchmark covers a wide mix of datasets, including semi-synthetic and real ones. Reports a comprehensive set of metrics. Provides reproducible code.

Weaknesses

The provided README in the anonymized GitHub doesn't make it clear how to add new learners to the benchmark. For this benchmark to be widely adopted, this should be described (and made as simple as possible). Minor point, but TMLE can't generally be used to estimate the CATE. The parameter isn't smooth enough for TMLE to be used to estimate it. An exception to this occurs when measuring effect modification with respect to a discrete summary of the baseline covariates (e.g., Stitelman et al., 20

Code & Models

Datasets

snoroozi/SurvHTE-Bench
dataset· 717 dl
717 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques · Genetic Associations and Epidemiology · Statistical Methods and Inference