SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Jungyoub Cha, Hyunjong Kim, Sungzoon Cho

TL;DR
SpecExtend enhances speculative decoding for long sequences in large language models by integrating efficient attention mechanisms and a novel retrieval strategy, significantly speeding up inference without retraining.
Contribution
It introduces SpecExtend, a drop-in enhancement that improves long-sequence decoding speed and accuracy using efficient attention and a new retrieval-based context selection method.
Findings
Up to 2.84x acceleration on 16K-token summarization
Up to 3.86x acceleration on long-form reasoning
Preserves short-input performance of existing frameworks
Abstract
Speculative decoding is a widely used technique for accelerating inference in large language models (LLMs), but its performance degrades as input length grows, with significant drops even at moderate lengths. Yet, this early degradation has remained largely underexplored. We introduce SpecExtend, a drop-in enhancement that improves speculative decoding on long sequences without additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention to accelerate prefill and verification steps. To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that leverages the target model's attention scores to dynamically select relevant context for the smaller draft model. Extensive evaluations show that SpecExtend accelerates speculative decoding by up to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Identifies a valuable and underexplored problem: the sharp performance drop of EAGLE-based speculative decoding on long sequences. 2. Presents a rich and comprehensive experimental evaluation across diverse tasks, model variants, and input lengths, with extensive comparisons to multiple strong baselines. 3. The proposed Cross-model Retrieval method is integrated into an overall framework that remains training-free, broadly compatible, and achieves notable speedups without sacrificing short-in
1. The use of efficient attention is mainly an implementation detail rather than a core contribution, which makes the baseline comparisons somewhat unfair. 2. Lacks discussion and empirical analysis on the choice of using the target model’s last-layer attention for draft KV retrieval.
1. The paper successfully identifies and tackles a specific, important, and largely underexplored problem: the early performance drop of speculative decoding in the moderate-length regime. This is a valuable contribution that moves the field beyond focusing solely on the extreme-length memory bottleneck. 2. The core idea of Cross-model Retrieval is novel and intuitive. Using the more powerful target model as an "oracle" to guide the context management of the smaller, less capable draft model is
1. A significant concern is the reliance on the target model's attention scores as an objective measure of context importance. Attention mechanisms are known to exhibit idiosyncratic model-specific behaviors, such as "attention sinks," where high scores are assigned to initial tokens regardless of their semantic relevance. By using these scores to guide the draft model's cache, there is a risk that CMR simply teaches the draft model to replicate the target model's attentional biases rather than
1. This paper presents a practical, training-free augmentation that combines hybrid tree attention with KV-cache eviction to speed up speculative decoding on long inputs. 2. This paper proposes Cross-model Retrieval, leveraging target-model attention to guide KV compression for the draft model, aiming to improve both drafting accuracy and end-to-end latency.
1. The contribution reads as an engineering integration of known components—hybrid attention and KV cache eviction—for long-sequence acceleration, with limited new algorithmic insights beyond composing these pieces. 2. The novelty of Cross-model Retrieval appears limited: similar ideas using target-model attention to prune draft-side redundancy have been explored (e.g., works in EMNLP-2025-SpecVLM that analyze long-context drafting latency and use the verifier’s attention maps to guide pruning).
1. Innovative cache strategy: The Cross-Model Retrieval method effectively improves both speed and accuracy without retraining. 2. Strong empirical results: Demonstrates consistent, significant speedups and robustness across models and tasks. 3. Practicality and generality: Works as a plug-and-play enhancement compatible with existing speculative decoding frameworks.
1. The proposed CMR mechanism feels somewhat lightweight. It mainly relies on reusing attention scores for cache selection, which may limit novelty compared to prior works. 2. The paper uses Hybrid Tree Attention, which appears to originate from LongSpec. It would be helpful to clarify whether this component has been modified or is directly adopted. 3. Experiments are primarily evaluated up to 16K tokens, which might still be short for assessing scalability in modern LLMs. It would strengthen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
MethodsSoftmax · Attention Is All You Need
