Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde; Fatemeh Doudi; Mahdi Farahbakhsh; Dileep Kalathil; Krishna Narayanan; Jean-Francois Chamberland

arXiv:2603.12554·cs.LG·May 15, 2026

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

PDF

1 Repo 2 Models

TL;DR

This paper introduces a novel reinforcement learning approach for diffusion language models that leverages entropy-guided step selection and stepwise advantages, achieving state-of-the-art results in coding and reasoning tasks.

Contribution

It formulates diffusion-based sequence generation as a Markov decision process and derives an unbiased policy gradient that decomposes over denoising steps, improving training efficiency and effectiveness.

Findings

01

Achieves state-of-the-art results on coding and logical reasoning benchmarks.

02

Outperforms existing RL post-training methods for diffusion language models.

03

Demonstrates strong performance on mathematical reasoning tasks.

Abstract

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vishnutez/egspo-dllm-rl
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.