Relative Score Policy Optimization for Diffusion Language Models

Zichao Yu; Shengze Xu; Bingqing Jiang; Wenyi Zhang; Difan Zou

arXiv:2605.10218·cs.CL·May 12, 2026

Relative Score Policy Optimization for Diffusion Language Models

Zichao Yu, Shengze Xu, Bingqing Jiang, Wenyi Zhang, Difan Zou

PDF

TL;DR

This paper introduces RSPO, a novel reinforcement learning method for diffusion language models that improves reasoning by calibrating noisy likelihood estimates using verifiable rewards.

Contribution

RSPO is a simple, effective policy optimization algorithm that leverages relative log-ratio calibration to enhance reasoning capabilities in diffusion language models.

Findings

01

RSPO achieves strong gains on planning tasks.

02

RSPO shows competitive performance in mathematical reasoning.

03

The method stabilizes RL training by reducing variance in score estimates.

Abstract

Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.