A simplified approach to generating synthetic data for disclosure control
Gillian Raab, Beata Nowok, Chris Dibben

TL;DR
This paper presents a simplified method for generating synthetic data to enhance data privacy, including new variance estimates and practical recommendations, demonstrated through a Scottish Longitudinal Study example.
Contribution
It introduces novel variance estimation techniques for synthetic data that do not rely on posterior predictive distributions, simplifying the data synthesis process.
Findings
New variance estimates for synthetic data without posterior predictive reliance
Recommendations for synthesising data effectively
Successful application to Scottish Longitudinal Study data
Abstract
We describe results on the creation and use of synthetic data that were derived in the context of a project to make synthetic extracts available for users of the UK Longitudinal Studies. A critical review of existing methods of inference from large synthetic data sets is presented. We introduce new variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data. We make recommendations on how to synthesise data based on these findings. An example of synthesising data from the Scottish Longitudinal Study is included to illustrate our results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Big Data Technologies and Applications · Data Quality and Management
