Simple Imputation Rules for Prediction with Missing Data: Contrasting Theoretical Guarantees with Empirical Performance
Dimitris Bertsimas, Arthur Delarue, Jean Pauphilet

TL;DR
This paper analyzes the effectiveness of simple imputation methods in prediction tasks with missing data, revealing that crude methods like mean-impute can be asymptotically optimal despite being considered less plausible.
Contribution
It provides theoretical guarantees for impute-then-regress pipelines and compares their empirical performance, highlighting when simple imputation methods are effective.
Findings
Mean-impute is asymptotically optimal for prediction.
Mode-impute is asymptotically sub-optimal.
Empirical results mostly support theoretical conclusions.
Abstract
Missing data is a common issue in real-world datasets. This paper studies the performance of impute-then-regress pipelines by contrasting theoretical and empirical evidence. We establish the asymptotic consistency of such pipelines for a broad family of imputation methods. While common sense suggests that a `good' imputation method produces datasets that are plausible, we show, on the contrary, that, as far as prediction is concerned, crude can be good. Among others, we find that mode-impute is asymptotically sub-optimal, while mean-impute is asymptotically optimal. We then exhaustively assess the validity of these theoretical conclusions on a large corpus of synthetic, semi-real, and real datasets. While the empirical evidence we collect mostly supports our theoretical findings, it also highlights gaps between theory and practice and opportunities for future research, regarding the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Analysis with R · Explainable Artificial Intelligence (XAI) · Hydrological Forecasting Using AI
