TL;DR
DiSPO introduces a novel credit-assignment layer for masked diffusion language models, enabling more effective intermediate decision optimization and improving performance on math and planning benchmarks.
Contribution
It proposes a plug-in credit-assignment layer that directly optimizes intermediate filling decisions in masked diffusion models, enhancing their performance.
Findings
DiSPO improves baseline performance on math and planning tasks.
It requires no additional multi-step diffusion rollouts or optimizer steps.
Supports use as a general plug-in for masked diffusion policy optimization.
Abstract
Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
