A simplified approach to generating synthetic data for disclosure   control

Gillian Raab; Beata Nowok; Chris Dibben

arXiv:1409.0217·stat.ME·December 12, 2017·2 cites

A simplified approach to generating synthetic data for disclosure control

Gillian Raab, Beata Nowok, Chris Dibben

PDF

Open Access

TL;DR

This paper presents a simplified method for generating synthetic data to enhance data privacy, including new variance estimates and practical recommendations, demonstrated through a Scottish Longitudinal Study example.

Contribution

It introduces novel variance estimation techniques for synthetic data that do not rely on posterior predictive distributions, simplifying the data synthesis process.

Findings

01

New variance estimates for synthetic data without posterior predictive reliance

02

Recommendations for synthesising data effectively

03

Successful application to Scottish Longitudinal Study data

Abstract

We describe results on the creation and use of synthetic data that were derived in the context of a project to make synthetic extracts available for users of the UK Longitudinal Studies. A critical review of existing methods of inference from large synthetic data sets is presented. We introduce new variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data. We make recommendations on how to synthesise data based on these findings. An example of synthesising data from the Scottish Longitudinal Study is included to illustrate our results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Big Data Technologies and Applications · Data Quality and Management