Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms
Ximing Li, Chendi Wang, Guang Cheng

TL;DR
This paper provides a rigorous statistical analysis of Bayesian network-based differentially private data synthesis methods, establishing accuracy guarantees, utility bounds, and lower bounds for privacy-preserving synthetic data.
Contribution
It introduces the first statistical guarantees for Bayesian network-based DP data synthesis, including upper and lower bounds on accuracy and utility.
Findings
Established total variation and L2 error bounds for DP Bayesian network algorithms.
Derived utility error bounds related to downstream machine learning tasks.
Proved a lower bound on TV accuracy for all epsilon-DP synthetic data generators.
Abstract
Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Traffic Prediction and Management Techniques · Stochastic Gradient Optimization Techniques
