An Empirical Comparison of Multiple Imputation Methods for Categorical Data
Olanrewaju Akande, Fan Li, Jerome Reiter

TL;DR
This study empirically compares multiple imputation methods for categorical data, revealing that regression tree and Bayesian approaches outperform traditional generalized linear models in simulation settings.
Contribution
It provides a systematic comparison of default multiple imputation methods for categorical data, highlighting the relative performance of regression trees and Bayesian models.
Findings
Regression tree and Bayesian methods outperform generalized linear models.
Both regression trees and Bayesian models are reasonable defaults for categorical data imputation.
Simulation results based on American Community Survey data support these conclusions.
Abstract
Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
