Evaluating quality in synthetic data generation for large tabular health datasets
Jean-Baptiste Escudi\'e, Benjamin Barnes, Stefan Meisegeier, Klaus Kraywinkel, Fabian Prasser, Nils K\"orber

TL;DR
This paper evaluates seven recent synthetic data generation models on large health datasets, proposing a new evaluation methodology and analyzing their fidelity and domain adherence.
Contribution
It introduces a systematic evaluation framework for synthetic health data and provides insights into model performance and domain-specific challenges.
Findings
Proposed a visualization-aligned metric for joint distribution fidelity.
Evaluated models across datasets of varying scales with systematic hyperparameter tuning.
Identified challenges models face in maintaining medical domain fidelity.
Abstract
There is no consensus in the field of synthetic data on concise metrics for quality evaluations or benchmarks on large health datasets, such as historical epidemiological data. This study presents an evaluation of seven recent models from major machine learning families. The models were evaluated using four different datasets, each with a distinct scale. To ensure a fair comparison, we systematically tuned the hyperparameters of each model for each dataset. We propose a methodology for evaluating the fidelity of synthesized joint distributions, aligning metrics with visualization on a single plot. This method is applicable to any dataset and is complemented by a domain-specific analysis of the German Cancer Registries' epidemiological dataset. The analysis reveals the challenges models face in strictly adhering to the medical domain. We hope this approach will serve as a foundational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
