Can synthetic data reproduce real-world findings in epidemiology? A replication study using adversarial random forests
Jan Kapar, Kathrin G\"unther, Lori Ann Vallis, Klaus Berger, Nadine Binder, Hermann Brenner, Stefanie Castell, Beate Fischer, Volker Harth, Bernd Holleczek, Timm Intemann, Till Ittermann, Andr\'e Karch, Thomas Keil, Lilian Krist, Berit Lange, Michael F. Leitzmann

TL;DR
This study introduces adversarial random forests (ARF) for synthesizing epidemiological data and demonstrates its effectiveness in reliably reproducing key research findings while maintaining privacy.
Contribution
The paper presents ARF as an efficient, high-quality method for generating synthetic epidemiological data, validated through replication of multiple studies and comparison with existing synthesizers.
Findings
ARF-generated data consistently replicated original study results.
Performance improved with lower dataset dimensionality and variable complexity.
ARF outperformed other synthesizers in utility, privacy, and efficiency.
Abstract
Synthetic data holds substantial potential to address practical challenges in epidemiology due to restricted data access and privacy concerns. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility and measure privacy risks sufficiently. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research while preserving privacy. We propose adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications covering blood pressure, anthropometry, myocardial infarction, accelerometry,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
