On integrating the number of synthetic data sets $m$ into the 'a priori'   synthesis approach

James Edward Jackson; Robin Mitra; Brian Joseph Francis; Iain Dove

arXiv:2205.05993·stat.ME·May 13, 2022·PSD

On integrating the number of synthetic data sets $m$ into the 'a priori' synthesis approach

James Edward Jackson, Robin Mitra, Brian Joseph Francis, Iain Dove

PDF

Open Access

TL;DR

This paper examines how to optimally choose the number of synthetic datasets for categorical data synthesis, balancing data utility and disclosure risk, and introduces new risk metrics with empirical validation.

Contribution

It introduces a framework for selecting the optimal number of synthetic datasets considering the risk-utility trade-off and proposes new risk assessment metrics for categorical data.

Findings

01

Increasing synthetic datasets improves utility but raises disclosure risk.

02

The paper proposes metrics τ₃(k,d) and τ₄(k,d) for risk assessment.

03

Empirical demonstrations validate the proposed methods.

Abstract

Until recently, multiple synthetic data sets were always released to analysts, to allow valid inferences to be obtained. However, under certain conditions - including when saturated count models are used to synthesize categorical data - single imputation ( $m = 1$ ) is sufficient. Nevertheless, increasing $m$ causes utility to improve, but at the expense of higher risk, an example of the risk-utility trade-off. The question, therefore, is: which value of $m$ is optimal with respect to the risk-utility trade-off? Moreover, the paper considers two ways of analysing categorical data sets: as they have a contingency table representation, multiple categorical data sets can be averaged before being analysed, as opposed to the usual way of averaging post-analysis. This paper also introduces a pair of metrics, $τ_{3} (k, d)$ and $τ_{4} (k, d)$ , that are suited for assessing disclosure risk in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models