Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Jingyang Ou; Jiaqi Han; Minkai Xu; Shaoxuan Xu; Jianwen Xie; Stefano Ermon; Yi Wu; Chongxuan Li

arXiv:2512.03759·cs.CL·December 4, 2025

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, Chongxuan Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ESPO, a sequence-level RL framework for diffusion LLMs that treats entire sequences as actions, overcoming likelihood approximation challenges and significantly improving performance on reasoning and coding tasks.

Contribution

The paper proposes a novel ELBO-based sequence-level policy optimization method for diffusion LLMs, addressing fundamental likelihood approximation issues in RL.

Findings

01

ESPO outperforms token-level baselines by 20-40 points on the Countdown task

02

Achieves consistent improvements on math and coding benchmarks

03

Establishes sequence-level optimization as a principled RL paradigm for dLLMs

Abstract

Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

* The paper explains the problem setup and why token-level importance ratios lack a valid probabilistic interpretation for dLLMs fairly well. The shortcomings of existing methods also makes the motivations quite clear. * I like the idea of moving to a sequence level objective and using an ELBO-based ratio avoiding heuristic token surrogates. * The performance of the method is tested across a variety of tasks and two different base models and shows consistent improvement over the baselines. *

Weaknesses

* ESPO optimizes an ELBO difference, not the true sequence likelihood ratio but the paper does not quantify how ELBO tightness affects policy improvement or bias across tasks. * The paper misses some closely related prior work on RL fine-tuning of diffusion language models [1, 2]. I believe a comparison to these baselines would be critical. [1] Venkatraman et al., 2024. Amortizing intractable inference in diffusion models for vision, language, and control. [2] Zekri and Boullé, 2025. Fine-Tuni

Reviewer 02Rating 6Confidence 3

Strengths

- Using a sequence-level action space makes the method very simple - Experiments show strong improvements over the baselines which do not treat the entire sequence as an action

Weaknesses

- Seems like a straightforward application of GSPO to diffusion models novelty-wise - Not clear why per-token evaluation is necessarily bad for diffusion LLMs specifically - is it true for all LLMs (as GSPO claims) or just diffusion LLMs?

Reviewer 03Rating 4Confidence 4

Strengths

- The paper is well-written with a logical structure that makes the technical content accessible. The progression from problem formulation to methodology to experimental validation is easy to follow. - The experimental evaluation demonstrates notable improvements on both the Countdown and math coding tasks, suggesting the proposed approach is effective for the target applications.

Weaknesses

- The proposed method largely combines existing techniques from prior work (Zheng et al., 2025; Tang & Munos, 2025b) without introducing new algorithmic components or theoretical insights. The contribution appears primarily incremental, adapting established methods to the diffusion language model setting rather than developing new approaches tailored to the unique characteristics of these models. - While the empirical improvements are encouraging, the paper does not address the fundamental ques

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics