Energy-Based Diffusion Language Models for Text Generation

Minkai Xu; Tomas Geffner; Karsten Kreis; Weili Nie; Yilun Xu; Jure; Leskovec; Stefano Ermon; Arash Vahdat

arXiv:2410.21357·cs.CL·March 10, 2025

Energy-Based Diffusion Language Models for Text Generation

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure, Leskovec, Stefano Ermon, Arash Vahdat

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Energy-based Diffusion Language Models (EDLM), which improve diffusion-based text generation by using an energy-based approach at the sequence level, achieving better performance and faster sampling compared to existing diffusion models.

Contribution

The paper proposes EDLM, an energy-based diffusion model for text generation that enhances approximation quality and sampling efficiency, bridging the gap with autoregressive models.

Findings

01

EDLM outperforms state-of-the-art diffusion models on benchmarks.

02

EDLM approaches autoregressive models' perplexity.

03

EDLM offers a 1.3× speedup in sampling without performance loss.

Abstract

Despite remarkable progress in autoregressive language models, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion Language Model (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 2

Strengths

- The paper is clearly written and easy to follow. - The method is intuitive and based on a solid mathematical foundation. - The experiments follow the standard setup for evaluating Diffusion Language Models, making it easy to compare with other methods.

Weaknesses

- This method requires a pre-trained discrete diffusion model, which increases the overall computational requirements. Thus, it may be unfair to compare it directly with simpler methods like MLDM. - While the proposed method reduces the Gen PPL metric, it also decreases the entropy of generated texts. One could even argue that it produces similar results to MLDM in terms of Gen PPL and entropy. **Recommended Experiments**: It would be interesting to see a more detailed trade-off between entrop

Reviewer 02Rating 8Confidence 4

Strengths

- This paper is well-written and well-organized, presenting a simple yet elegant combination of energy-based and discrete diffusion models. EDLM demonstrates strong empirical results that match auto-regressive models with great improvements in sampling speed. - The adaptation of auto-regressive models to a joint denoising distribution with masked inputs is innovative. - The application and detailed analysis of importance sampling windows effectively improve the early sampling phases in discrete

Weaknesses

- While the study aims to address the independence assumptions in discrete diffusion models through EBMs, there is insufficient examination of relevant prior research also involving EBMs for language modeling (e.g., [1]). Given the similarity between EDLMs and [1], a more detailed discussion and comparison would clarify the position and relevance of this study. - Another concern lies in the significance of applying EBMs to discrete diffusion models. Although vanilla discrete diffusion processes

Reviewer 03Rating 6Confidence 3

Strengths

1. Using an AR to help the sampling of Masked Discrete Diffusion Language Model is natural and straightforward. 2. The experiments demonstrate good improvements compared with MDLM baseline.

Weaknesses

**My main concern is that the core technique introduced in this work has been present in the literature for a long time.** 1. First, the framework of the absorbing discrete diffusion model is essentially the same as the BERT-like masked language model (MLM). The forward process corresponds to masking tokens in the input, while the backward process corresponds to predicting and remasking tokens during iterative generation in MLMs. Therefore, this work primarily explores how to apply an energy-ba

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

Methodsenergy-based model · Diffusion