Developing synthetic microdata through machine learning for firm-level business surveys
Jorge Cisneros, Timothy Wojan, Matthew Williams, Jennifer Ozawa, Robert Chew, Kimberly Janda, Timothy Navarro, Michael Floyd, Christine Task, Damon Streat

TL;DR
This paper presents a machine learning approach to generate synthetic firm-level microdata that preserves key data characteristics while protecting confidentiality, addressing unique challenges in business survey data.
Contribution
It introduces a novel machine learning model for creating synthetic firm data from surveys, specifically tailored to address confidentiality and industry identification issues.
Findings
Synthetic data closely replicates real survey data.
Econometric analysis confirms data utility for research.
Method enhances confidentiality in business microdata.
Abstract
Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurvey Methodology and Nonresponse · demographic modeling and climate adaptation · Human Mobility and Location-Based Analysis
