Can synthetic data reproduce real-world findings in epidemiology? A replication study using adversarial random forests

Jan Kapar; Kathrin G\"unther; Lori Ann Vallis; Klaus Berger; Nadine Binder; Hermann Brenner; Stefanie Castell; Beate Fischer; Volker Harth; Bernd Holleczek; Timm Intemann; Till Ittermann; Andr\'e Karch; Thomas Keil; Lilian Krist; Berit Lange; Michael F. Leitzmann; Katharina Nimptsch; Nadia Obi; Iris Pigeot; Tobias Pischon; Tamara Schikowski; B\"orge Schmidt; Carsten Oliver Schmidt; Anja M. Sedlmair; Justine Tanoey; Harm Wienbergen; Andreas Wienke; Claudia Wigmann; Marvin N. Wright

arXiv:2508.14936·q-bio.QM·May 6, 2026

Can synthetic data reproduce real-world findings in epidemiology? A replication study using adversarial random forests

Jan Kapar, Kathrin G\"unther, Lori Ann Vallis, Klaus Berger, Nadine Binder, Hermann Brenner, Stefanie Castell, Beate Fischer, Volker Harth, Bernd Holleczek, Timm Intemann, Till Ittermann, Andr\'e Karch, Thomas Keil, Lilian Krist, Berit Lange, Michael F. Leitzmann

PDF

TL;DR

This study introduces adversarial random forests (ARF) for synthesizing epidemiological data and demonstrates its effectiveness in reliably reproducing key research findings while maintaining privacy.

Contribution

The paper presents ARF as an efficient, high-quality method for generating synthetic epidemiological data, validated through replication of multiple studies and comparison with existing synthesizers.

Findings

01

ARF-generated data consistently replicated original study results.

02

Performance improved with lower dataset dimensionality and variable complexity.

03

ARF outperformed other synthesizers in utility, privacy, and efficiency.

Abstract

Synthetic data holds substantial potential to address practical challenges in epidemiology due to restricted data access and privacy concerns. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility and measure privacy risks sufficiently. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research while preserving privacy. We propose adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications covering blood pressure, anthropometry, myocardial infarction, accelerometry,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.