Evaluation of imputation techniques with varying percentage of missing   data

Seema Sangari; Herman E. Ray

arXiv:2109.04227·stat.ME·December 27, 2022

Evaluation of imputation techniques with varying percentage of missing data

Seema Sangari, Herman E. Ray

PDF

Open Access

TL;DR

This paper compares five recent imputation methods for missing data, finding that missForest outperforms others in accuracy, with implications for data analysis practices.

Contribution

It provides a formal comparison of five emerging imputation techniques against traditional methods using RMSE, highlighting the superior performance of missForest.

Findings

01

missForest achieved the lowest RMSE among methods

02

mi algorithm performed the worst in accuracy

03

performance varied with percentage of missing data

Abstract

Missing data is a common problem which has consistently plagued statisticians and applied analytical researchers. While replacement methods like mean-based or hot deck imputation have been well researched, emerging imputation techniques enabled through improved computational resources have had limited formal assessment. This study formally considers five more recently developed imputation methods: Amelia, Mice, mi, Hmisc and missForest - compares their performances using RMSE against actual values and against the well-established mean-based replacement approach. The RMSE measure was consolidated by method using a ranking approach. Our results indicate that the missForest algorithm performed best and the mi algorithm performed worst.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Bayesian Inference · Statistical Methods and Inference · Advanced Statistical Methods and Models