dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Wenxuan Zhang; Lemeng Wu; Changsheng Zhao; Ernie Chang; Mingchen Zhuge; Zechun Liu; Andy Su; Hanxian Huang; Jun Chen; Chong Zhou; Raghuraman Krishnamoorthi; Vikas Chandra; Mohamed Elhoseiny; Wei Wen

arXiv:2603.18806·cs.AI·April 14, 2026

dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Wenxuan Zhang, Lemeng Wu, Changsheng Zhao, Ernie Chang, Mingchen Zhuge, Zechun Liu, Andy Su, Hanxian Huang, Jun Chen, Chong Zhou, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Wei Wen

PDF

TL;DR

dTRPO introduces a trajectory reduction method for policy optimization in diffusion large language models, improving training efficiency and performance on various benchmarks.

Contribution

It proposes a novel trajectory reduction strategy integrated into policy optimization, enabling scalable offline training and improved performance of diffusion LLMs.

Findings

01

Up to 9.6% improvement on STEM tasks

02

Up to 4.3% improvement on coding tasks

03

Up to 3.0% improvement on instruction-following tasks

Abstract

Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.