TL;DR
This paper introduces a correct batch speculative decoding framework that guarantees output distribution equivalence with standard decoding, overcoming previous synchronization issues and achieving significant throughput improvements.
Contribution
It formalizes synchronization invariants, presents EQSPEC for guaranteed output equivalence, and introduces EXSPEC to reduce overhead via dynamic sequence grouping.
Findings
Achieves up to 3x throughput improvement at batch size 8.
Maintains 95% decoding-equivalence with standard decoding.
First framework to ensure output distribution correctness in batch speculative decoding.
Abstract
Speculative decoding must produce outputs distribution identical to standard autoregressive generation-this output equivalence is not an optimization target but the defining criterion of valid speculative decoding. We demonstrate that all existing batch speculative decoding implementations violate this fundamental requirement, producing corrupted outputs ranging from repetitive tokens to gibberish. These failures stem from the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, desynchronizing position IDs, attention masks, and KV-cache state. We present the first authentic batch speculative decoding framework. We (1) formalize the synchronization invariants that valid batch speculative decoding must satisfy, (2) present EQSPEC, the first algorithm that guarantees output equivalence, and analyze its cost structure to show that alignment overhead…
Peer Reviews
Decision·Submitted to ICLR 2026
1. **An important topic:** It addresses a critical correctness vs. performance trade-off in the LLM production environment. 2. **Innovation:** The clever use of the scheduling mechanism (cross-batch) to resolve a data structure problem (realignment overhead) is a prime example of system-level optimization. 3. **Significant improvement:** The 3x throughput improvement is achieved while maintaining a high correctness guarantee.
1. **Compatibility issues:** The compatibility with common modern inference techniques like continuous batching and paged attention remains future work. 2. **Lack of fully quantified metrics:** While EXSPEC avoids realignment, cross-batch scheduling itself might introduce new scheduling latency. The paper needs to further discuss and quantify EXSPEC's scheduling overhead under realistic high-concurrency workloads.
- The paper provides a detailed analysis of the existing problems in batch speculative decoding and proposes direct solutions to the most critical issues. For example, the EQSPEC design introduces the *unpad–repad* strategy to ensure correctness, while EXSPEC employs a *dynamic scheduling mechanism* to improve efficiency. - The paper quantitatively analyzes the actual cost composition and speedup factors in batch speculative decoding, offering an in-depth breakdown of various cost sources and th
- The models used for validation in this paper, such as Vicuna and GLM, are relatively outdated and small in scale. Since speculative decoding provides limited acceleration benefits for smaller models, the effectiveness and impact of the proposed methods may be somewhat diminished. - The paper does not introduce substantial optimizations for KV cache management. Its realignment process is implemented by re-concatenating a rank-4 KV tensor, which imposes significant memory overhead. In contrast,
The topic is highly related to practical LLM usage. Batch speculative decoding is an important direction, yet there has been few existing works focusing on the correctness, mostly on speed performances. The group-then-padding algorithm is practically useful. It can mitigate the length misalignment in an efficient way. Specifically, as stated in Section2, it does not involve modifying position IDs, avoiding re-implementing a whole new kernel, and also preserve the accepted tokens from being crop
Main concerns: As the paper stated, the problem of current inference systems is about incorrect output, which is caused by, alleged, KV-cache and position-ID errors. I think the root causes should be more specified and quantified. Is it because current systems have not implemented batch SD supports, or the implementation is incorrect, or just float precision is not accurate enough? Specifically, vLLM can achieve high match accuracy on Vicuna, but lower on other models. If the cause is about mis
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
