Do we Need Dozens of Methods for Real World Missing Value Imputation?
Krystyna Grzesiak, Christophe Muller, Julie Josse, Jeffrey N\"af

TL;DR
This paper introduces a systematic benchmarking approach for missing value imputation, emphasizing distributional accuracy and real-world scenarios, and finds iterative methods like mice to be most effective.
Contribution
It presents a novel benchmarking framework that evaluates imputation methods on distributional preservation across synthetic and real-world data, including mixed data types.
Findings
Iterative imputation methods outperform others.
Mice R package methods are most effective.
Evaluation on real-world missingness scenarios.
Abstract
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. While many studies compare imputation approaches, they often focus on a limited subset of algorithms and evaluate performance primarily through pointwise metrics such as RMSE, which are not suitable to measure the preservation of the true data distribution. In this work, we provide a systematic benchmarking method based on the idea of treating imputation as a distributional prediction task. We consider a large number of algorithms and, for the first time, evaluate them not only on synthetic missing mechanisms, but also on real-world missingness scenarios, using the concept of Imputation Scores. Finally, while the focus of previous benchmark has often been on numerical data, we also consider mixed data sets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Markov Chains and Monte Carlo Methods · Data Analysis with R
