Evaluation of imputation techniques with varying percentage of missing data
Seema Sangari, Herman E. Ray

TL;DR
This paper compares five recent imputation methods for missing data, finding that missForest outperforms others in accuracy, with implications for data analysis practices.
Contribution
It provides a formal comparison of five emerging imputation techniques against traditional methods using RMSE, highlighting the superior performance of missForest.
Findings
missForest achieved the lowest RMSE among methods
mi algorithm performed the worst in accuracy
performance varied with percentage of missing data
Abstract
Missing data is a common problem which has consistently plagued statisticians and applied analytical researchers. While replacement methods like mean-based or hot deck imputation have been well researched, emerging imputation techniques enabled through improved computational resources have had limited formal assessment. This study formally considers five more recently developed imputation methods: Amelia, Mice, mi, Hmisc and missForest - compares their performances using RMSE against actual values and against the well-established mean-based replacement approach. The RMSE measure was consolidated by method using a ranking approach. Our results indicate that the missForest algorithm performed best and the mi algorithm performed worst.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Statistical Methods and Inference · Advanced Statistical Methods and Models
