Multiple Imputation Through XGBoost
Yongshi Deng, Thomas Lumley

TL;DR
This paper introduces mixgb, a scalable multiple imputation framework using XGBoost that efficiently handles large, complex datasets by capturing non-linear relations and interactions, improving imputation accuracy and computational speed.
Contribution
The paper presents a novel MI framework based on XGBoost, combining subsampling and predictive mean matching for improved scalability and performance on large datasets.
Findings
Effective handling of large datasets with complex structures
High computational efficiency achieved by XGBoost-based imputation
Reduced bias through subsampling and predictive mean matching
Abstract
The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, which requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this paper, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and non-linear relations while achieving high computational efficiency. In addition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Statistical Methods and Inference · Bayesian Methods and Mixture Models
