Evaluating Synthetic Tabular Data Generated To Augment Small Sample   Datasets

Javier Marin

arXiv:2211.10760·cs.LG·March 18, 2025·1 cites

Evaluating Synthetic Tabular Data Generated To Augment Small Sample Datasets

Javier Marin

PDF

Open Access

TL;DR

This paper critically evaluates methods for assessing synthetic tabular data used to augment small datasets, revealing limitations of traditional metrics and proposing a topological approach with noted instability.

Contribution

It introduces a normalized Bottleneck distance metric for evaluating synthetic data and highlights the need for multi-faceted validation strategies for small sample augmentation.

Findings

01

Global metrics often misrepresent true differences

02

Topological measures show high variability and instability

03

Traditional statistical tests are unreliable with small samples

Abstract

This work proposes a method to evaluate synthetic tabular data generated to augment small sample datasets. While data augmentation techniques can increase sample counts for machine learning applications, traditional validation approaches fail when applied to extremely limited sample sizes. Our experiments across four datasets reveal significant inconsistencies between global metrics and topological measures, with statistical tests producing unreliable significance values due to insufficient sample sizes. We demonstrate that common metrics like propensity scoring and MMD often suggest similarity where fundamental topological differences exist. Our proposed normalized Bottleneck distance based metric provides complementary insights but suffers from high variability across experimental runs and occasional values exceeding theoretical bounds, showing inherent instability in topological…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models · Statistical Methods and Inference · Bayesian Methods and Mixture Models