Generating Accurate Synthetic Survival Data by Conditioning on Outcomes
Mohd Ashhad, Ricardo Henao

TL;DR
This paper introduces a new method for generating synthetic survival data that accurately captures event times and censoring, enhancing data utility while respecting privacy and fairness constraints.
Contribution
The authors propose a simple, effective approach to generate covariates conditioned on survival outcomes without assuming censoring mechanisms, improving synthetic data quality.
Findings
Outperforms baseline methods in real-world datasets
Improves downstream survival model performance
Accurately reproduces distributions of event and censored times
Abstract
Synthetically generated data can improve privacy, fairness, and data accessibility; however, it can be challenging in specialized scenarios such as survival analysis. One key challenge in this setting is censoring, i.e., the timing of an event is unknown in some cases. Existing methods struggle to accurately reproduce the distributions of both observed and censored event times when generating synthetic data. We propose a conceptually simple approach that generates covariates conditioned on event times and censoring indicators by leveraging existing tabular data generation models without making assumptions about the mechanism underlying censoring. Experiments on real-world datasets demonstrate that our method consistently outperforms baselines and improves downstream survival model performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Machine Learning in Healthcare · Machine Learning and Data Classification
