Benchmarking Differentially Private Synthetic Data Generation Algorithms

Yuchao Tao; Ryan McKenna; Michael Hay; Ashwin Machanavajjhala; Gerome; Miklau

arXiv:2112.09238·cs.CR·February 16, 2022·24 cites

Benchmarking Differentially Private Synthetic Data Generation Algorithms

Yuchao Tao, Ryan McKenna, Michael Hay, Ashwin Machanavajjhala, Gerome, Miklau

PDF

Open Access

TL;DR

This paper systematically benchmarks various differentially private synthetic data generation algorithms for tabular data, evaluating their utility in preserving data distribution, correlations, and ML model accuracy to identify top performers.

Contribution

It provides a comprehensive empirical comparison of algorithms, highlighting their strengths and weaknesses in generating useful synthetic tabular data.

Findings

01

Top algorithms effectively preserve data distributions and correlations.

02

Some algorithms consistently outperform baselines in utility metrics.

03

Certain methods fail to surpass simple baseline approaches.

Abstract

This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Advanced Data Storage Technologies