DiffPuter: Empowering Diffusion Models for Missing Data Imputation

Hengrui Zhang; Liancheng Fang; Qitian Wu; Philip S. Yu

arXiv:2405.20690·cs.LG·May 27, 2025

DiffPuter: Empowering Diffusion Models for Missing Data Imputation

Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

DiffPuter introduces a diffusion model combined with EM algorithm for missing data imputation, effectively learning joint data distribution and performing accurate conditional sampling, outperforming existing methods across multiple datasets.

Contribution

This paper presents DiffPuter, a novel diffusion-based imputation method that integrates EM algorithm for improved missing data estimation, addressing challenges of incomplete training data and conditional inference.

Findings

01

DiffPuter outperforms 17 existing imputation methods in MAE and RMSE.

02

Theoretical analysis links DiffPuter's training to maximum likelihood estimation.

03

Extensive experiments validate the effectiveness of DiffPuter across diverse datasets.

Abstract

Generative models play an important role in missing data imputation in that they aim to learn the joint distribution of full data. However, applying advanced deep generative models (such as Diffusion models) to missing data imputation is challenging due to 1) the inherent incompleteness of the training data and 2) the difficulty in performing conditional inference from unconditional generative models. To deal with these challenges, this paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation. DiffPuter iteratively trains a diffusion model to learn the joint distribution of missing and observed data and performs an accurate conditional sampling to update the missing values using a tailored reversed sampling strategy. Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 6Confidence 4

Strengths

* The paper is well-written with a clear, thorough, and concise introduction that effectively summarizes key points from previous works * The authors specifically address the challenges of the problem and provide clever solution to mitigate them * The paper's main novelty is supported by theoretical proof * The evaluations are comprehensive, with thorough and convincing ablation studies

Weaknesses

* A major concern regarding evaluations - while the paper claims to use a single hyperparameter setting throughout, it's unclear how hyperparameters for other methods were selected and their sensitivity to these HP. For me, this concern significantly impacts the overall assessment of the paper. * While the results are impressive, their importance is not clear. A more convincing evaluation would include the effect on downstream tasks, given imputation is only a first step in most pipelines. *

Reviewer 02Rating 8Confidence 3

Strengths

1. Theoretical analysis: DIFFPUTER’s training step corresponds to the maximum likelihood estimation of data density (M-step), and its sampling step represents the Expected A Posteriori estimation of missing values (E-step). 2. Extensive experiments that demonstrate the good performance, as compared with existing baselines, of the proposed method across various datasets.

Weaknesses

the computational complexity is not explicitely discussed or compared on the numerical experiments, see details below.

Reviewer 03Rating 8Confidence 4

Strengths

- Robust imputation method based on EM. - Well written and structured. - The method is theoretically grounded. - The empirical analysis is extensive.

Weaknesses

- Motivation appears to overlook recent work. - Experimental section lacks fair comparison and clarity. - Discussion of limitations is lacking. - Given these weaknesses, the contribution is not strongly justified. ------- Post rebuttal update ------- All the weaknesses were thoroughly addressed in the rebuttal provided by the authors. I appreciate their efforts and the detailed responses, which resolved all my concerns.

Code & Models

Repositories

hengruizhang98/DiffPuter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference

MethodsALIGN · Masked autoencoder · Diffusion