Generating Reliable Synthetic Clinical Trial Data: The Role of Hyperparameter Optimization and Domain Constraints
Waldemar Hahn, Jan-Niklas Eckardt, Christoph R\"ollig, Martin Sedlmayr, Jan Moritz Middeke, Markus Wolfien

TL;DR
This paper evaluates hyperparameter optimization strategies for generative models to produce high-quality synthetic clinical trial data, highlighting the importance of domain constraints and processing steps for data validity.
Contribution
It systematically compares HPO objectives across models and emphasizes combining HPO with domain knowledge to enhance synthetic clinical data quality.
Findings
HPO improves synthetic data quality across models.
Compound metrics outperform single-metric optimization.
Preprocessing and postprocessing reduce clinical constraint violations.
Abstract
The generation of synthetic clinical trial data offers a promising approach to mitigating privacy concerns and data accessibility limitations in medical research. However, ensuring that synthetic datasets maintain high fidelity, utility, and adherence to domain-specific constraints remains a key challenge. While hyperparameter optimization (HPO) improves generative model performance, the effectiveness of different optimization strategies for synthetic clinical data remains unclear. This study systematically evaluates four HPO objectives across nine generative models, comparing single-metric to compound metric optimization. Our results demonstrate that HPO consistently improves synthetic data quality, with Tab DDPM achieving the largest relative gains, followed by TVAE (60%), CTGAN (39%), and CTAB-GAN+ (38%). Compound metric optimization outperformed single-metric objectives, producing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsHyper-parameter optimization
