Generating High-quality Privacy-preserving Synthetic Data
David Yavo, Richard Khoury, Christophe Pere, and Sadoune Ait Kaci Azzou

TL;DR
This paper introduces a post-processing framework for synthetic tabular data that improves distributional fidelity, utility, and privacy by repairing categories and enforcing minimum distances, applicable to any generative model.
Contribution
The authors propose a simple, model-agnostic post-processing method that enhances synthetic data quality and privacy without requiring changes to the underlying generative models.
Findings
Reduces divergence between real and synthetic categorical distributions by up to 36%
Improves dependence preservation by 10-14%
Maintains predictive performance within 1% of baseline
Abstract
Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Imbalanced Data Classification Techniques · Data Quality and Management
