Likelihood Matching for Diffusion Models

Lei Qian; Wu Su; Yanqi Huang; Song Xi Chen

arXiv:2508.03636·stat.ML·January 23, 2026

Likelihood Matching for Diffusion Models

Lei Qian, Wu Su, Yanqi Huang, Song Xi Chen

PDF

3 Reviews

TL;DR

This paper introduces a Likelihood Matching method for training diffusion models by approximating transition densities with Gaussian distributions, ensuring consistent likelihood estimation and providing convergence guarantees.

Contribution

It presents a novel likelihood matching framework for diffusion models, including a quasi-likelihood approximation, a stochastic sampler, and theoretical convergence analysis.

Findings

01

Effective likelihood matching improves diffusion model training

02

The proposed sampler converges with quantifiable error bounds

03

Empirical results validate theoretical guarantees

Abstract

We propose a Likelihood Matching approach for training diffusion models by first establishing an equivalence between the likelihood of the target data distribution and a likelihood along the sample path of the reverse diffusion. To efficiently compute the reverse sample likelihood, a quasi-likelihood is considered to approximate each reverse transition density by a Gaussian distribution with matched conditional mean and covariance, respectively. The score and Hessian functions for the diffusion generation are estimated by maximizing the quasi-likelihood, ensuring a consistent matching of both the first two transitional moments between every two time points. A stochastic sampler is introduced to facilitate computation that leverages both the estimated score and Hessian information. We establish consistency of the quasi-maximum likelihood estimation, and provide non-asymptotic convergence…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

* Conceptually elegant: connects data likelihood with path likelihood of reverse diffusion: an underexplored yet fundamental viewpoint. * Introduces a principled QMLE formulation, integrating first- and second-order information beyond prior Hessian-regularized SM methods. * Solid theoretical results with practical, scalable implementation (low-rank Hessian, SMW updates). * Consistent FID/NLL improvements and faster convergence in sampling.

Weaknesses

* Novelty could be more clearly contrasted with prior MLE-based diffusion ODE works [1,2,3] * Experiments remain small-scale. [1] Song, Yang, et al. "Maximum likelihood training of score-based diffusion models." Advances in neural information processing systems 34 (2021): 1415-1428. [2] Lu, Cheng, et al. "Maximum likelihood training for score-based diffusion odes by high order denoising score matching." International conference on machine learning. PMLR, 2022. [3] Zheng, Kaiwen, et al. "Impro

Reviewer 02Rating 4Confidence 3

Strengths

Overall the paper has some interesting results. It provides non-asymptotic convergence guarantees for the proposed sampler in total variation, characterizing the errors in terms of score and Hessian estimation error, dimension d, and diffusion steps T. It theoretically demonstrate the consistency of the proposed quasi-maximum likelihood diffusion training under reverse quasi-likelihood objectives. Multiple simplifications are made that make this approach implementable in practice, albeit it st

Weaknesses

**Main Weaknesses** *W1* It is unclear why the Hessian is necessary theoretically. The reverse diffusion generates precisely the same distributions as the forward one, and the only unknown term therein is the score. In this sense, the score is a sufficient statistic to go backwards. The Fokker-Planck equation implies the same conclusion, as by formulating the backward probability evolution via an ODE, then the target distribution is perfectly modeled if the score has been perfectly learned. The

Reviewer 03Rating 2Confidence 5

Strengths

There are some theoretical analyses that make the paper look okay.

Weaknesses

1. The covariance estimation of the Gaussian denoising diffusion model is a solved problem. I am not sure why this paper didn't discuss any related work on this. I will give a brief introduction to this line of research and show **why learning the covariance under quasi-MLE is unnecessary**. All the papers I list below use a Gaussian variational distribution to approximate the denoising distribution under forward KL divergence, which is the same as the quasi-MLE terminology that this paper menti

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.