DiffPuter: Empowering Diffusion Models for Missing Data Imputation
Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

TL;DR
DiffPuter introduces a diffusion model combined with EM algorithm for missing data imputation, effectively learning joint data distribution and performing accurate conditional sampling, outperforming existing methods across multiple datasets.
Contribution
This paper presents DiffPuter, a novel diffusion-based imputation method that integrates EM algorithm for improved missing data estimation, addressing challenges of incomplete training data and conditional inference.
Findings
DiffPuter outperforms 17 existing imputation methods in MAE and RMSE.
Theoretical analysis links DiffPuter's training to maximum likelihood estimation.
Extensive experiments validate the effectiveness of DiffPuter across diverse datasets.
Abstract
Generative models play an important role in missing data imputation in that they aim to learn the joint distribution of full data. However, applying advanced deep generative models (such as Diffusion models) to missing data imputation is challenging due to 1) the inherent incompleteness of the training data and 2) the difficulty in performing conditional inference from unconditional generative models. To deal with these challenges, this paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation. DiffPuter iteratively trains a diffusion model to learn the joint distribution of missing and observed data and performs an accurate conditional sampling to update the missing values using a tailored reversed sampling strategy. Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum…
Peer Reviews
Decision·ICLR 2025 Spotlight
* The paper is well-written with a clear, thorough, and concise introduction that effectively summarizes key points from previous works * The authors specifically address the challenges of the problem and provide clever solution to mitigate them * The paper's main novelty is supported by theoretical proof * The evaluations are comprehensive, with thorough and convincing ablation studies
* A major concern regarding evaluations - while the paper claims to use a single hyperparameter setting throughout, it's unclear how hyperparameters for other methods were selected and their sensitivity to these HP. For me, this concern significantly impacts the overall assessment of the paper. * While the results are impressive, their importance is not clear. A more convincing evaluation would include the effect on downstream tasks, given imputation is only a first step in most pipelines. *
1. Theoretical analysis: DIFFPUTER’s training step corresponds to the maximum likelihood estimation of data density (M-step), and its sampling step represents the Expected A Posteriori estimation of missing values (E-step). 2. Extensive experiments that demonstrate the good performance, as compared with existing baselines, of the proposed method across various datasets.
the computational complexity is not explicitely discussed or compared on the numerical experiments, see details below.
- Robust imputation method based on EM. - Well written and structured. - The method is theoretically grounded. - The empirical analysis is extensive.
- Motivation appears to overlook recent work. - Experimental section lacks fair comparison and clarity. - Discussion of limitations is lacking. - Given these weaknesses, the contribution is not strongly justified. ------- Post rebuttal update ------- All the weaknesses were thoroughly addressed in the rebuttal provided by the authors. I appreciate their efforts and the detailed responses, which resolved all my concerns.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference
MethodsALIGN · Masked autoencoder · Diffusion
