Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata
Claire Little, Mark Elliot, Richard Allmendinger

TL;DR
This paper introduces a framework to evaluate the utility and disclosure risk of synthetic Census microdata by comparing it to samples of original data, aiding in understanding the trade-offs and improving data sharing practices.
Contribution
The paper presents a novel methodology for measuring and comparing the utility and disclosure risk of synthetic data against sampled original data, enhancing data privacy and utility assessment.
Findings
Synthetic data can match the utility and risk levels of specific sample fractions.
Comparison of three synthesis packages reveals differences in data utility and disclosure risk.
The methodology provides a promising approach for evaluating synthetic data in practice.
Abstract
Most statistical agencies release randomly selected samples of Census microdata, usually with sample fractions under 10% and with other forms of statistical disclosure control (SDC) applied. An alternative to SDC is data synthesis, which has been attracting growing interest, yet there is no clear consensus on how to measure the associated utility and disclosure risk of the data. The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible. This paper follows on from previous work by the authors which mapped synthetic Census data on a risk-utility (R-U) map. The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions, thereby identifying the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsdemographic modeling and climate adaptation · Privacy-Preserving Technologies in Data · Spatial and Panel Data Analysis
