Generating Accurate Synthetic Survival Data by Conditioning on Outcomes

Mohd Ashhad; Ricardo Henao

arXiv:2405.17333·stat.ML·August 7, 2025

Generating Accurate Synthetic Survival Data by Conditioning on Outcomes

Mohd Ashhad, Ricardo Henao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new method for generating synthetic survival data that accurately captures event times and censoring, enhancing data utility while respecting privacy and fairness constraints.

Contribution

The authors propose a simple, effective approach to generate covariates conditioned on survival outcomes without assuming censoring mechanisms, improving synthetic data quality.

Findings

01

Outperforms baseline methods in real-world datasets

02

Improves downstream survival model performance

03

Accurately reproduces distributions of event and censored times

Abstract

Synthetically generated data can improve privacy, fairness, and data accessibility; however, it can be challenging in specialized scenarios such as survival analysis. One key challenge in this setting is censoring, i.e., the timing of an event is unknown in some cases. Existing methods struggle to accurately reproduce the distributions of both observed and censored event times when generating synthetic data. We propose a conceptually simple approach that generates covariates conditioned on event times and censoring indicators by leveraging existing tabular data generation models without making assumptions about the mechanism underlying censoring. Experiments on real-world datasets demonstrate that our method consistently outperforms baselines and improves downstream survival model performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anonymous-785/synthetic_survival_data
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Machine Learning in Healthcare · Machine Learning and Data Classification