On the consistency of supervised learning with missing values
Julie Josse (XPOP, CMAP), Jacob M. Chen, Nicolas Prost (CMAP, XPOP,, PARIETAL), Erwan Scornet (X, CMAP, SU), Ga\"el Varoquaux (PARIETAL)

TL;DR
This paper investigates the consistency of supervised learning methods with missing data, revealing that simple mean imputation can be consistent in certain cases and advocating for decision trees that directly handle missingness.
Contribution
It demonstrates the theoretical and empirical consistency of simple imputation methods and decision trees that incorporate missing values directly in supervised learning.
Findings
Mean imputation is consistent when missingness is non-informative.
Decision trees with missing values can achieve optimal prediction.
The 'missing incorporated in attribute' method effectively handles various missing data scenarios.
Abstract
In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Neural Networks and Applications · Control Systems and Identification
