Evaluating quality in synthetic data generation for large tabular health datasets

Jean-Baptiste Escudi\'e; Benjamin Barnes; Stefan Meisegeier; Klaus Kraywinkel; Fabian Prasser; Nils K\"orber

arXiv:2604.15961·cs.LG·April 20, 2026

Evaluating quality in synthetic data generation for large tabular health datasets

Jean-Baptiste Escudi\'e, Benjamin Barnes, Stefan Meisegeier, Klaus Kraywinkel, Fabian Prasser, Nils K\"orber

PDF

TL;DR

This paper evaluates seven recent synthetic data generation models on large health datasets, proposing a new evaluation methodology and analyzing their fidelity and domain adherence.

Contribution

It introduces a systematic evaluation framework for synthetic health data and provides insights into model performance and domain-specific challenges.

Findings

01

Proposed a visualization-aligned metric for joint distribution fidelity.

02

Evaluated models across datasets of varying scales with systematic hyperparameter tuning.

03

Identified challenges models face in maintaining medical domain fidelity.

Abstract

There is no consensus in the field of synthetic data on concise metrics for quality evaluations or benchmarks on large health datasets, such as historical epidemiological data. This study presents an evaluation of seven recent models from major machine learning families. The models were evaluated using four different datasets, each with a distinct scale. To ensure a fair comparison, we systematically tuned the hyperparameters of each model for each dataset. We propose a methodology for evaluating the fidelity of synthesized joint distributions, aligning metrics with visualization on a single plot. This method is applicable to any dataset and is complemented by a domain-specific analysis of the German Cancer Registries' epidemiological dataset. The analysis reveals the challenges models face in strictly adhering to the medical domain. We hope this approach will serve as a foundational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.