Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data
Gillian M Raab

TL;DR
This paper presents two differentially private methods for generating synthetic categorical data, evaluates their utility and disclosure risk across various datasets, and discusses the trade-offs involved.
Contribution
Introduces two DP synthetic data generation methods for categorical data, incorporated into the R synthpop package, with evaluation on multiple datasets.
Findings
The first method reduces disclosure risk but yields low data utility.
The second method provides usable synthetic data at low epsilon values.
Trade-offs between privacy and utility depend on dataset characteristics.
Abstract
This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the \textit{synthpop} package for \textbf{R}. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Probability and Risk Models
