Toward Valid Generative Clinical Trial Data with Survival Endpoints
Perrine Chassat, Van Tuan Nguyen, Lucas Ducrot, Emilie Lanoy, Agathe Guilloux

TL;DR
This paper presents a novel variational autoencoder approach for generating synthetic clinical trial data with survival endpoints, addressing challenges of censoring and small sample sizes, and improving data utility and privacy.
Contribution
A new VAE-based model for joint generation of covariates and survival outcomes without assuming independent censoring, outperforming GANs in fidelity and privacy.
Findings
VAE model outperforms GAN baselines in data fidelity and privacy.
Synthetic data improves control-arm augmentation and privacy preservation.
Calibration issues remain, but post-generation procedures help mitigate them.
Abstract
Clinical trials face mounting challenges: fragmented patient populations, slow enrollment, and unsustainable costs, particularly for late phase trials in oncology and rare diseases. While external control arms built from real-world data have been explored, a promising alternative is the generation of synthetic control arms using generative AI. A central challenge is the generation of time-to-event outcomes, which constitute primary endpoints in oncology and rare disease trials, but are difficult to model under censoring and small sample sizes. Existing generative approaches, largely GAN-based, are data-hungry, unstable, and rely on strong assumptions such as independent censoring. We introduce a variational autoencoder (VAE) that jointly generates mixed-type covariates and survival outcomes within a unified latent variable framework, without assuming independent censoring. Across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Privacy-Preserving Technologies in Data
