TL;DR
SureLock is a method that detects when tokens in masked diffusion language models have stabilized, allowing the model to skip unnecessary recomputations and significantly reduce computational costs while maintaining quality.
Contribution
The paper introduces SureLock, a novel technique that locks stabilized tokens during diffusion decoding, reducing computation without sacrificing output quality.
Findings
SureLock reduces FLOPs by 30-50% on LLaDA-8B.
It maintains comparable generation quality to standard methods.
Theoretical analysis justifies the local KL monitoring for locking decisions.
Abstract
Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from to where is the sequence length, is the number of unlocked token positions, and is the model dimension. In practice, decreases as the iteration progresses, yielding substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
