Distribution Shift in Missing Data Imputation: A Risk-Based Perspective and Importance-Weighted Correction under MAR
Luke Shannon, Song Liu, Katarzyna Reluga

TL;DR
This paper addresses distribution shift in missing data imputation, proposing a risk-based formulation and an importance-weighted correction method to improve model performance under MAR conditions.
Contribution
It introduces a novel imputation algorithm that explicitly accounts for distribution shift caused by missingness dependence, improving accuracy over existing methods.
Findings
Simulation studies show 3% RMSE reduction.
Simulation studies show 7% Wasserstein distance reduction.
Method outperforms uncorrected baselines.
Abstract
Missing data imputation, where a model is trained on observed data to estimate unobserved values, is a fundamental problem in machine learning. In this paper, we rigorously formulate imputation model learning as a mean-squared error risk minimisation problem. We show that when the probability of missingness depends on the data, many state-of-the-art methods fail to account for the resulting distribution shift between the observed data used for training and the full data distribution used for evaluation. Consequently, these approaches do not minimise mean-squared error on the full data distribution. Instead, we propose a novel imputation algorithm designed to learn an imputation model from the observed data while explicitly accounting for this distribution shift. Simulation studies show consistent improvements over otherwise identical uncorrected baselines, with average reductions of 3%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
