From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories
Guanglin Zhou, Armin Catic, Motahare Shabestari, Matthew Young, Chaiquan Li, Katrina Poppe, Sebastiano Barbieri

TL;DR
This paper presents a scalable pipeline for generating and auditing synthetic electronic health records that are both statistically accurate and clinically consistent, enabling safer data sharing for healthcare research.
Contribution
The authors developed an integrated method combining high-fidelity generative modeling with automated auditing using large language models to ensure clinical consistency in synthetic EHRs.
Findings
Synthetic records closely match real data in statistical properties.
Automated auditing significantly reduces clinical inconsistencies.
Models trained on audited data perform as well or better than those trained on real data.
Abstract
Access to electronic health records (EHRs) for digital health research is often limited by privacy regulations and institutional barriers. Synthetic EHRs have been proposed as a way to enable safe and sovereign data sharing; however, existing methods may produce records that capture overall statistical properties of real data but present inconsistencies across clinical processes and observations. We developed an integrated pipeline to make synthetic patient trajectories clinically consistent through two synergistic steps: high-fidelity generation and scalable auditing. Using the MIMIC-IV database, we trained a knowledge-grounded generative model that represents nearly 32,000 distinct clinical events, including demographics, laboratory measurements, medications, procedures, and diagnoses, while enforcing structural integrity. To support clinical consistency at scale, we incorporated an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Machine Learning in Healthcare · Electronic Health Records Systems
