Utility and Disclosure Risk for Differentially Private Synthetic   Categorical Data

Gillian M Raab

arXiv:2206.01362·stat.AP·June 28, 2022·PSD

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

Gillian M Raab

PDF

Open Access

TL;DR

This paper presents two differentially private methods for generating synthetic categorical data, evaluates their utility and disclosure risk across various datasets, and discusses the trade-offs involved.

Contribution

Introduces two DP synthetic data generation methods for categorical data, incorporated into the R synthpop package, with evaluation on multiple datasets.

Findings

01

The first method reduces disclosure risk but yields low data utility.

02

The second method provides usable synthetic data at low epsilon values.

03

Trade-offs between privacy and utility depend on dataset characteristics.

Abstract

This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the \textit{synthpop} package for \textbf{R}. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Probability and Risk Models