d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Leyi Pan; Shuchang Tao; Yunpeng Zhai; Zheyu Fu; Liancheng Fang; Minghua He; Lingzhe Zhang; Zhaoyang Liu; Bolin Ding; Aiwei Liu; Lijie Wen

arXiv:2512.09675·cs.CL·May 14, 2026

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang, Minghua He, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen

PDF

TL;DR

d-TreeRPO is a new reinforcement learning framework for diffusion language models that improves reliability and reasoning performance by using tree-structured rollouts and confidence-based training techniques.

Contribution

It introduces a tree-structured rollout method and a confidence-guided training loss to enhance policy optimization for diffusion language models.

Findings

01

Achieves +86.2% on Sudoku

02

Achieves +51.6% on Countdown

03

Achieves +4.5% on GSM8K

Abstract

Reinforcement learning (RL) is pivotal for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, existing dLLM policy optimization methods suffer from two critical reliability bottlenecks: (1) reward sparsity, arising from coarse or unverifiable signals that impede accurate advantage calculation; and (2) their probability estimates do not account for the gap to the unbiased expectation over all decoding orders, which are intractable to compute. To mitigate these issues, we propose d-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. Furthermore, we provide a theoretical proof demonstrating that increasing prediction confidence effectively minimizes the gap between unbiased expected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.