Naive imputation implicitly regularizes high-dimensional linear models

Alexis Ayme (LPSM (UMR\_8001)); Claire Boyer (LPSM (UMR\_8001)),; Aymeric Dieuleveut (CMAP); Erwan Scornet (CMAP)

arXiv:2301.13585·math.ST·February 1, 2023·ICML·1 cites

Naive imputation implicitly regularizes high-dimensional linear models

Alexis Ayme (LPSM (UMR\_8001)), Claire Boyer (LPSM (UMR\_8001)),, Aymeric Dieuleveut (CMAP), Erwan Scornet (CMAP)

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that naive zero imputation in high-dimensional linear models acts as an implicit regularizer similar to ridge regression, explaining its surprisingly good predictive performance despite bias.

Contribution

It provides a theoretical analysis linking zero imputation to ridge regularization and recommends averaged SGD on imputed data for effective prediction.

Findings

01

Zero imputation performs implicit ridge regularization.

02

Imputation bias diminishes in high-dimensional settings.

03

Averaged SGD on imputed data yields good generalization bounds.

Abstract

Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Naive imputation implicitly regularizes high-dimensional linear models· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Statistical Methods and Inference