"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity
William H. Press

TL;DR
This paper introduces MODP, a method that predicts and generates synthetic census data with high fidelity in crosstabulations, balancing data utility and privacy protection.
Contribution
The paper presents a novel predictive approach, MODP, for generating synthetic categorical survey data that preserves statistical associations with high accuracy.
Findings
Synthetic data crosstab accuracy median ~5% error
Method maintains data utility across large cell counts
Privacy protection is quantitatively assessed
Abstract
We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Mobility and Location-Based Analysis · Traffic Prediction and Management Techniques
