IMAGIC-500: IMputation benchmark on A Generative Imaginary Country (500k samples)
Siyi Sun, David Antony Selby, Yunchuan Huang, Sebastian Vollmer, Seth Flaxman, Anisoara Calinescu

TL;DR
This paper presents IMAGIC-500, a comprehensive benchmark dataset and evaluation framework for missing data imputation methods on socioeconomic data, promoting reproducibility and development of robust algorithms.
Contribution
Introduces IMAGIC-500, a large synthetic socioeconomic dataset and benchmark for evaluating imputation methods under various missing data scenarios.
Findings
Diffusion-based methods show competitive performance.
Imputation accuracy varies across missing mechanisms.
Benchmark facilitates systematic comparison of imputation techniques.
Abstract
Missing data imputation in tabular datasets remains a pivotal challenge in data science and machine learning, particularly within socioeconomic research. However, real-world socioeconomic datasets are typically subject to strict data protection protocols, which often prohibit public sharing, even for synthetic derivatives. This severely limits the reproducibility and accessibility of benchmark studies in such settings. Further, there are very few publicly available synthetic datasets. Thus, there is limited availability of benchmarks for systematic evaluation of imputation methods on socioeconomic datasets, whether real or synthetic. In this study, we utilize the World Bank's publicly available synthetic dataset, Synthetic Data for an Imaginary Country, which closely mimics a real World Bank household survey while being fully public, enabling broad access for methodological research.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsdemographic modeling and climate adaptation · Human Mobility and Location-Based Analysis · Survey Methodology and Nonresponse
