TL;DR
DPRM introduces a flexible token-ordering module for diffusion language models, enhancing their ability to control token reveal and revision processes, leading to improved performance in various tasks.
Contribution
It proposes a novel plug-in token-ordering method based on Doob h transform, which maintains the original model architecture while optimizing token reveal policies.
Findings
DPRM outperforms confidence-based baselines in pretraining and test-time scaling.
It achieves significant gains on harder reasoning subsets.
In bioinformatics applications, DPRM improves structural and fragment-constrained metrics.
Abstract
Diffusion language models generate without a fixed left-to-right order, making token ordering a central algorithmic choice: which tokens should be revealed, retained, revised or verified at each step? Existing systems mainly use random masking or confidence-driven ordering. Random masking creates train--test mismatch, while confidence-only rules are efficient but can be myopic and suppress useful exploration. We introduce DPRM (Doob h-transform Process Reward Model), a plug-in token-ordering module for diffusion language models. DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence-driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove O(1/N)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
