Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Pengxiang Li; Joey Tsai; Hongwei Xue; Kunyu Shi; Shilin Yan

arXiv:2603.05454·cs.CV·March 6, 2026

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

PDF

Open Access 3 Reviews

TL;DR

The paper introduces the Longest Stable Prefix (LSP) scheduler, a novel inference method for Diffusion Language Models that significantly accelerates text generation by improving cache efficiency and reducing token flip rates.

Contribution

LSP is a training-free, model-agnostic inference paradigm that enhances DLM efficiency by dynamically identifying stable token prefixes and committing them atomically.

Findings

01

LSP accelerates inference by up to 3.4x across various tasks.

02

It reduces token flip rates and denoiser calls.

03

It maintains or improves output quality.

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The left-to-right commitment strategy dramatically improves KV cache efficiency. 2. The adaptive sizing mechanism intelligently modulates generation speed based on model confidence, achieving a superior speed-quality balance compared to fixed-size strategies. 3. The method achieves significant inference acceleration without sacrificing, and in some cases even improving, generation quality. 4. Its training-free and model-agnostic nature makes the method highly practical and broadly ge

Weaknesses

1. **Limited Comparative Baselines:** The empirical evaluation primarily compares LSP against "Full decoding," which serves as a quality baseline rather than a competitive speed-oriented one. The paper does not include a direct comparison against other contemporary DLM acceleration techniques, making it difficult to position LSP's performance within the existing state-of-the-art. 2. **Insufficient Hyperparameter Analysis:** The paper lacks a sensitivity analysis for its key hyperparameter, th

Reviewer 02Rating 4Confidence 4

Strengths

- Simple and effective approach that mitigates cache fragmentation without retraining. - Practical efficiency gains demonstrated across multiple pretrained DLMs. - Clear experimental reporting and consistent evaluation settings. - Compatible with existing architectures, requiring minimal modification.

Weaknesses

- The main novelty lies in using left windowed confidence instead of position-wise confidence, which is conceptually similar to autoregressive commitment heuristics. - The prefix-first decoding constraint may limit diffusion’s flexibility for editing, in-fill, or parallel token generation tasks. - The geometric decay rule for active suffix length and its thresholding lacks theoretical or empirical grounding. - GSM8K is a relatively simple benchmark for 7B-scale models; evaluating on AMC or AIME

Reviewer 03Rating 2Confidence 4

Strengths

- Identifies an important problem - Proposes a practical and elegant solution, especially because it is training free. - Demonstrates strong speedup performance and sometimes slight quality gains. - Thorough ablation studies.

Weaknesses

- The proposed method is less of a DLM and more of a blockwise autoregressive decoding. - The additional proposals (structural snapping) are not optional solutions, they are critical patches to get blockwise decoding to work. - Structural snapping is domain specific and might not perform well always. How is the performance on CJK? - The prefix commitment is irreversible, which means one of the most important advantages of DLMs is gone. - There is no mention on KV cache update of the committed se

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis