TL;DR
This paper introduces a novel reinforcement learning approach for diffusion language models that leverages entropy-guided step selection and stepwise advantages, achieving state-of-the-art results in coding and reasoning tasks.
Contribution
It formulates diffusion-based sequence generation as a Markov decision process and derives an unbiased policy gradient that decomposes over denoising steps, improving training efficiency and effectiveness.
Findings
Achieves state-of-the-art results on coding and logical reasoning benchmarks.
Outperforms existing RL post-training methods for diffusion language models.
Demonstrates strong performance on mathematical reasoning tasks.
Abstract
Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
