A Diffusive Classification Loss for Learning Energy-based Generative Models

RuiKang OuYang; Louis Grenioux; Jos\'e Miguel Hern\'andez-Lobato

arXiv:2601.21025·stat.ML·May 22, 2026

A Diffusive Classification Loss for Learning Energy-based Generative Models

RuiKang OuYang, Louis Grenioux, Jos\'e Miguel Hern\'andez-Lobato

PDF

3 Reviews

TL;DR

This paper introduces Diffusive Classification (DiffCLF), a new training objective for energy-based models that improves efficiency and effectiveness, enabling better generative and sampling tasks.

Contribution

The paper proposes DiffCLF, a supervised classification-based training method for EBMs that overcomes mode blindness and enhances model fidelity and applicability.

Findings

01

DiffCLF accurately estimates energies in Gaussian mixtures.

02

Models trained with DiffCLF perform well in composition and Boltzmann sampling tasks.

03

DiffCLF outperforms existing EBM training methods in efficiency and quality.

Abstract

Score-based generative models have recently achieved remarkable success. While they are usually parameterized by the score, an alternative way is to use a series of time-dependent energy-based models (EBMs), where the score is obtained from the negative input-gradient of the energy. Crucially, EBMs can be leveraged not only for generation, but also for tasks such as compositional sampling or building Boltzmann Generators via Monte Carlo methods. However, training EBMs remains challenging. Direct maximum likelihood is computationally prohibitive due to the need for nested sampling, while score matching, though efficient, suffers from mode blindness. To address these issues, we introduce the Diffusive Classification (DiffCLF) objective, a simple method that avoids blindness while remaining computationally efficient. DiffCLF reframes EBM learning as a supervised classification problem…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The proposed loss function for training time-dependent energy functions seem to be novel and interesting.

Weaknesses

Overall the paper is rather poorly written. While the manuscript spends several pages introducing the background, the key part of the proposed approach (section 3) is rather brief and needs further elaboration. In its current form, the objective function is not clearly explained, in particular why such objective is used. Moreover, the statement of the theoretical result is not precise and seems to miss assumptions. For example, the authors claimed that the score-matching methods suffer from t

Reviewer 02Rating 2Confidence 3

Strengths

* Addresses a real limitation of current diffusion models (mode blindness) by attempting to model energies, not only scores. * The idea of connecting diffusion training with a classification/NCE-type loss is novel and potentially useful. * Theoretical motivation and experiments are at least qualitatively consistent with the intended effect.

Weaknesses

* The paper is very difficult to follow. Section 2 in particular is chaotic: notation such as $X_t$, $Y_t$, $p_t$, $q_t$, $S(t)$, $\sigma(t)$ is introduced with little intuition or connection, and the relationship between the data distribution and the time-evolving process is unclear. * The stated objective “to estimate the densities $(p_t)_t$” is conceptually confusing—the goal should be to model the data distribution $p_0$, not all intermediate marginals. * Equations (6)–(7) appear without suf

Reviewer 03Rating 4Confidence 4

Strengths

- The proposed method allows to simultaneously learn the energy function and the normalizing constant, which is known to be a challenging task. - The authors provide a rich literature overview and connect their works to many related works in generative modelling, in particular with score-based models. - The proposed approach provides consistenly better classification performance than the alternatives.

Weaknesses

- The proposed method provides an interesting framework to train simultaneously the energy function and the lognormalizing constant. However, the alternative considered in the paper, such as score-based training, have been widely studied these past few years. Wasserstein and KL upper bounds have been proposed in the strong log concave case and under weaker assuptions on the data distribution. As the method highlights better performance in classification loss, it would be very interesting to obt

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Gaussian Processes and Bayesian Inference · Machine Learning in Healthcare