Generating Realistic Synthetic Population Datasets
Hao Wu, Yue Ning, Prithwish Chakraborty, Jilles Vreeken, Nikolaj Tatti, and Naren Ramakrishnan

TL;DR
This paper introduces a maximum entropy-based method for generating realistic synthetic population datasets, enabling better modeling of societal phenomena while respecting privacy constraints.
Contribution
It presents a novel maximum entropy framework and an efficient inference algorithm for creating synthetic datasets from categorical data, validated on real and simulated data.
Findings
Effective estimation of data distributions demonstrated
Feasibility shown through epidemic simulation application
Outperforms existing methods in accuracy and efficiency
Abstract
Modern studies of societal phenomena rely on the availability of large datasets capturing attributes and activities of synthetic, city-level, populations. For instance, in epidemiology, synthetic population datasets are necessary to study disease propagation and intervention measures before implementation. In social science, synthetic population datasets are needed to understand how policy decisions might affect preferences and behaviors of individuals. In public health, synthetic population datasets are necessary to capture diagnostic and procedural characteristics of patient records without violating confidentialities of individuals. To generate such datasets over a large set of categorical variables, we propose the use of the maximum entropy principle to formalize a generative model such that in a statistically well-founded way we can optimally utilize given prior information about…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsdemographic modeling and climate adaptation · Insurance, Mortality, Demography, Risk Management · Data Analysis with R
