What's a good imputation to predict with missing values?
Marine Le Morvan (PARIETAL, IJCLab), Julie Josse (CRISAM), Erwan, Scornet (CMAP), Ga\"el Varoquaux (PARIETAL)

TL;DR
This paper demonstrates that for predicting with missing data, an impute-then-regress approach with a powerful learner is theoretically optimal across all missing data mechanisms, and proposes a joint imputation-regression neural network method that outperforms traditional two-step procedures.
Contribution
It provides a theoretical foundation showing the optimality of impute-then-regress, and introduces NeuMiss, a neural network for joint imputation and regression that improves prediction accuracy.
Findings
Impute-then-regress is Bayes optimal for almost all imputation functions.
Perfect conditional imputation is not necessary for good asymptotic prediction.
NeuMiss outperforms traditional two-step imputation and regression methods in experiments.
Abstract
How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation is not needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Statistical Methods and Bayesian Inference · Statistical Methods and Inference
