Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation
Zerui Xu, Fang Wu, Yingzhou Lu, Yuanyuan Zhang, Yue Zhao

TL;DR
This paper introduces a Retrieval-Reasoning framework using large language models to generate synthetic clinical trial reports, enhancing data availability for clinical research while preserving privacy.
Contribution
It presents a novel retrieval and reasoning-based method for generating realistic synthetic clinical trial data using LLMs, improving outcome prediction models.
Findings
Synthetic trials effectively augment real datasets
Hybrid fine-tuning improves clinical outcome prediction
Synthetic data preserves privacy while supporting research
Abstract
Machine learning (ML) holds great promise for clinical applications but is often hindered by limited access to high-quality data due to privacy concerns, high costs, and long timelines associated with clinical trials. While large language models (LLMs) have demonstrated strong performance in general-purpose generation tasks, their application to synthesizing realistic clinical trials remains underexplored. In this work, we propose a novel Retrieval-Reasoning framework that leverages few-shot prompting with LLMs to generate synthetic clinical trial reports annotated with binary success/failure outcomes. Our approach integrates a retrieval module to ground the generation on relevant trial data and a reasoning module to ensure domain-consistent justifications. Experiments conducted on real clinical trials from the ClinicalTrials.gov database demonstrate that the generated synthetic trials…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
