Relative Score Policy Optimization for Diffusion Language Models
Zichao Yu, Shengze Xu, Bingqing Jiang, Wenyi Zhang, Difan Zou

TL;DR
This paper introduces RSPO, a novel reinforcement learning method for diffusion language models that improves reasoning by calibrating noisy likelihood estimates using verifiable rewards.
Contribution
RSPO is a simple, effective policy optimization algorithm that leverages relative log-ratio calibration to enhance reasoning capabilities in diffusion language models.
Findings
RSPO achieves strong gains on planning tasks.
RSPO shows competitive performance in mathematical reasoning.
The method stabilizes RL training by reducing variance in score estimates.
Abstract
Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
