Benchmarking Differentially Private Synthetic Data Generation Algorithms
Yuchao Tao, Ryan McKenna, Michael Hay, Ashwin Machanavajjhala, Gerome, Miklau

TL;DR
This paper systematically benchmarks various differentially private synthetic data generation algorithms for tabular data, evaluating their utility in preserving data distribution, correlations, and ML model accuracy to identify top performers.
Contribution
It provides a comprehensive empirical comparison of algorithms, highlighting their strengths and weaknesses in generating useful synthetic tabular data.
Findings
Top algorithms effectively preserve data distributions and correlations.
Some algorithms consistently outperform baselines in utility metrics.
Certain methods fail to surpass simple baseline approaches.
Abstract
This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Advanced Data Storage Technologies
