Synthetic Tabular Data Validation: A Divergence-Based Approach

Patricia A. Apell\'aniz; Ana Jim\'enez; Borja Arroyo Galende and; Juan Parras; Santiago Zazo

arXiv:2405.07822·cs.LG·August 1, 2024

Synthetic Tabular Data Validation: A Divergence-Based Approach

Patricia A. Apell\'aniz, Ana Jim\'enez, Borja Arroyo Galende and, Juan Parras, Santiago Zazo

PDF

1 Repo

TL;DR

This paper introduces a divergence estimation method for validating synthetic tabular data by capturing joint distribution discrepancies, using probabilistic classifiers to improve accuracy over traditional marginal approaches.

Contribution

It proposes a novel divergence-based validation metric employing a probabilistic classifier to estimate joint distribution differences between real and synthetic data.

Findings

01

Accurately estimates divergences for simple distributions.

02

Effectively validates synthetic data on real-world datasets.

03

Outperforms traditional marginal comparison methods.

Abstract

The ever-increasing use of generative models in various fields where tabular data is used highlights the need for robust and standardized validation metrics to assess the similarity between real and synthetic data. Current methods lack a unified framework and rely on diverse and often inconclusive statistical measures. Divergences, which quantify discrepancies between data distributions, offer a promising avenue for validation. However, traditional approaches calculate divergences independently for each feature due to the complexity of joint distribution modeling. This paper addresses this challenge by proposing a novel approach that uses divergence estimation to overcome the limitations of marginal comparisons. Our core contribution lies in applying a divergence estimator to build a validation metric considering the joint distribution of real and synthetic data. We leverage a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

patricia-a-apellaniz/divergence_estimator
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.