Using saturated count models for user-friendly synthesis of categorical data
James Edward Jackson, Robin Mitra, Brian Joseph Francis, Iain Dove

TL;DR
This paper introduces a two-parameter saturated count model approach for synthesizing large categorical administrative datasets efficiently while controlling privacy risks and utility metrics before data generation.
Contribution
It proposes a novel synthesis method using negative binomial and Poisson-inverse Gaussian models tailored for large administrative data, enabling pre-synthesis risk and utility control.
Findings
Method effectively synthesizes large datasets quickly.
Allows pre-synthesis risk and utility assessment.
Enhances privacy protection for unique respondents.
Abstract
Over the past three decades, synthetic data methods for statistical disclosure control have continually evolved, but mainly within the domain of survey data sets. There are certain characteristics of administrative databases, such as their size, which present challenges from a synthesis perspective and require special attention. This paper, through the fitting of saturated count models, presents a synthesis method that is suitable for administrative databases that is tuned by two parameters. The method allows large categorical data sets to be synthesized quickly and allows risk and utility metrics to be satisfied a priori, that is, prior to synthetic data generation. The paper explores how the flexibility afforded by two-parameter count models (the negative binomial and Poisson-inverse Gaussian) can be utilised to protect respondents' - especially uniques' - privacy in synthetic data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCensus and Population Estimation · Statistical Methods and Bayesian Inference · Bayesian Modeling and Causal Inference
