On integrating the number of synthetic data sets $m$ into the 'a priori' synthesis approach
James Edward Jackson, Robin Mitra, Brian Joseph Francis, Iain Dove

TL;DR
This paper examines how to optimally choose the number of synthetic datasets for categorical data synthesis, balancing data utility and disclosure risk, and introduces new risk metrics with empirical validation.
Contribution
It introduces a framework for selecting the optimal number of synthetic datasets considering the risk-utility trade-off and proposes new risk assessment metrics for categorical data.
Findings
Increasing synthetic datasets improves utility but raises disclosure risk.
The paper proposes metrics τ₃(k,d) and τ₄(k,d) for risk assessment.
Empirical demonstrations validate the proposed methods.
Abstract
Until recently, multiple synthetic data sets were always released to analysts, to allow valid inferences to be obtained. However, under certain conditions - including when saturated count models are used to synthesize categorical data - single imputation () is sufficient. Nevertheless, increasing causes utility to improve, but at the expense of higher risk, an example of the risk-utility trade-off. The question, therefore, is: which value of is optimal with respect to the risk-utility trade-off? Moreover, the paper considers two ways of analysing categorical data sets: as they have a contingency table representation, multiple categorical data sets can be averaged before being analysed, as opposed to the usual way of averaging post-analysis. This paper also introduces a pair of metrics, and , that are suited for assessing disclosure risk in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models
