Stabilizing Reinforcement Learning for Diffusion Language Models
Jianyuan Zhong, Kaibo Wang, Ding Ding, Zijin Feng, Haoli Bai, Yang Xiang, Jiacheng Sun, Qiang Xu

TL;DR
This paper introduces StableDRL, a novel reinforcement learning method designed to stabilize policy optimization in diffusion large language models by addressing ratio estimation noise and gradient instability issues.
Contribution
The paper proposes StableDRL, a reformulation of GRPO with unconditional clipping and self-normalization, specifically tailored for diffusion large language models to prevent reward collapse.
Findings
StableDRL reduces gradient spikes and instability in diffusion LLMs.
The method improves training stability and policy performance.
Extension to block-wise diffusion models enhances applicability.
Abstract
Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO's formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Topic Modeling
