On the consistency of supervised learning with missing values

Julie Josse (XPOP; CMAP); Jacob M. Chen; Nicolas Prost (CMAP; XPOP,; PARIETAL); Erwan Scornet (X; CMAP; SU); Ga\"el Varoquaux (PARIETAL)

arXiv:1902.06931·stat.ML·March 22, 2024·62 cites

On the consistency of supervised learning with missing values

Julie Josse (XPOP, CMAP), Jacob M. Chen, Nicolas Prost (CMAP, XPOP,, PARIETAL), Erwan Scornet (X, CMAP, SU), Ga\"el Varoquaux (PARIETAL)

PDF

Open Access 3 Repos

TL;DR

This paper investigates the consistency of supervised learning methods with missing data, revealing that simple mean imputation can be consistent in certain cases and advocating for decision trees that directly handle missingness.

Contribution

It demonstrates the theoretical and empirical consistency of simple imputation methods and decision trees that incorporate missing values directly in supervised learning.

Findings

01

Mean imputation is consistent when missingness is non-informative.

02

Decision trees with missing values can achieve optimal prediction.

03

The 'missing incorporated in attribute' method effectively handles various missing data scenarios.

Abstract

In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Neural Networks and Applications · Control Systems and Identification