"Minus-One" Data Prediction Generates Synthetic Census Data with Good   Crosstabulation Fidelity

William H. Press

arXiv:2406.05264·stat.AP·June 11, 2024

"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity

William H. Press

PDF

Open Access 1 Repo

TL;DR

This paper introduces MODP, a method that predicts and generates synthetic census data with high fidelity in crosstabulations, balancing data utility and privacy protection.

Contribution

The paper presents a novel predictive approach, MODP, for generating synthetic categorical survey data that preserves statistical associations with high accuracy.

Findings

01

Synthetic data crosstab accuracy median ~5% error

02

Method maintains data utility across large cell counts

03

Privacy protection is quantitatively assessed

Abstract

We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

whpress/SyntheticCategoricalData
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Mobility and Location-Based Analysis · Traffic Prediction and Management Techniques