Generating Synthetic Electronic Health Record Data: a Methodological Scoping Review with Benchmarking on Phenotype Data and Open-Source Software
Xingran Chen, Zhenke Wu, Xu Shi, Hyunghoon Cho, Bhramar Mukherjee

TL;DR
This paper reviews and benchmarks methods for generating synthetic electronic health record data, providing open-source tools and guidelines to help practitioners select appropriate approaches based on their specific needs.
Contribution
It offers a comprehensive scoping review, benchmarks major synthetic EHR data generation methods on real datasets, and introduces an open-source Python package with a decision tree for method selection.
Findings
GAN-based methods perform well in data fidelity and utility
Rule-based methods excel in privacy protection
GAN methods outperform baselines in fidelity on MIMIC-IV
Abstract
We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, GAN-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
