Predicting missing values: A good idea?

Stef van Buuren

arXiv:2605.03733·stat.ML·May 6, 2026

Predicting missing values: A good idea?

Stef van Buuren

PDF

TL;DR

This paper shows that adding noise to imputed missing data can eliminate biases caused by MSE optimization, improving the validity of downstream analyses.

Contribution

It demonstrates that stochastic imputation with noise preserves data variability and reduces bias, challenging the reliance on MSE minimization in imputation methods.

Findings

01

Predictive imputation introduces bias in variance and correlation estimates.

02

Adding noise proportional to MSE can eliminate systematic biases.

03

Popular tools like missForest, softImpute, and mice exhibit biases in predictive mode.

Abstract

Minimizing the Mean Squared Error (MSE) is a key objective in machine learning and is commonly used for imputing missing values. While this approach provides accurate point estimates, it introduces systematic biases in downstream analyses. These biases affect key parameters such as variance, prevalence, correlation, slope, and explained variance. The root cause is that imputed values optimized for MSE are averages, which reduce the natural variability in the data. This paper demonstrates that adding noise to imputed values can effectively eliminate these biases. The required noise level is proportional to the MSE. Using a toy example in a multivariate normal setting, we compare two methods: predictive imputation, which minimizes MSE, and stochastic imputation, which incorporates random noise. Simulation results show that predictive methods systematically introduce bias, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.