GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using Macro Data Sources
Angeela Acharya, Siddhartha Sikdar, Sanmay Das, and Huzefa Rangwala

TL;DR
GenSyn is a multi-stage framework that synthesizes individual-level microdata by integrating macro data sources and auxiliary location information, improving data dependency preservation and privacy constraints.
Contribution
This paper introduces a novel multi-stage framework combining dependency graphs and Gaussian copulas to generate high-quality synthetic microdata from aggregated data sources.
Findings
Outperforms prior methods in dependency structure preservation
Effectively integrates macro and auxiliary data sources
Demonstrates robustness on real-world datasets
Abstract
Individual-level data (microdata) that characterizes a population, is essential for studying many real-world problems. However, acquiring such data is not straightforward due to cost and privacy constraints, and access is often limited to aggregated data (macro data) sources. In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data by combining information from multiple easier-to-obtain lower-resolution data sources. In particular, we introduce a framework that uses a combination of univariate and multivariate frequency tables from a given target geographical location in combination with frequency tables from other auxiliary locations to generate synthetic microdata for individuals in the target location. Our method combines the estimation of a dependency graph and conditional probabilities from the target location with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsdemographic modeling and climate adaptation · Human Mobility and Location-Based Analysis · Data-Driven Disease Surveillance
