Stability-Weighted Decoding for Diffusion Language Models
Yue Wu, Jian Huang

TL;DR
This paper introduces Stability-Weighted Decoding (SWD), a novel, training-free method that improves diffusion language model decoding by incorporating temporal stability, leading to more accurate and robust text generation.
Contribution
The paper provides a theoretical link between token instability and mutual information, and proposes SWD, a universal, plug-and-play decoding strategy that enhances diffusion LLM performance.
Findings
SWD improves accuracy across code and math benchmarks.
SWD maintains performance across different decoding policies.
SWD exhibits robustness under various acceleration ratios.
Abstract
Diffusion large language models (dLLMs) enable parallel text generation by iteratively denoising a fully masked sequence, unmasking a subset of masked tokens at each step. Existing decoding strategies rely on static confidence metrics computed at a single denoising step, ignoring temporal history and often leading to premature unmasking of unstable tokens. In this work, we theoretically establish that a token's temporal instability, quantified by the KL divergence between consecutive prediction distributions, provides a strict lower bound on its mutual information with the remaining masked context, indicating that temporally unstable tokens are inherently unsafe to unmask. Based on this insight, we propose Stability-Weighted Decoding (SWD), a training-free, plug-and-play strategy that incorporates temporal stability into token scoring and acts as a universal modulator for arbitrary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
