LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

Penghui Yang; Cunxiao Du; Fengzhuo Zhang; Haonan Wang; Tianyu Pang; Chao Du; Bo An

arXiv:2502.17421·cs.CL·April 9, 2026

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An

PDF

2 Repos 6 Models 1 Datasets

TL;DR

LongSpec introduces a novel framework for efficient, lossless speculative decoding in long-context language models, addressing memory, performance, and attention challenges to significantly accelerate inference.

Contribution

It proposes a memory-efficient draft model, new position indices, and an attention aggregation strategy to enable fast, accurate long-context decoding.

Findings

01

Achieves up to 3.26x speedup over Flash Attention baselines.

02

Reduces wall-clock time by 2.25x on long reasoning tasks.

03

Demonstrates significant latency improvements in long-context understanding.

Abstract

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

sail/longspec-data
dataset· 124 dl
124 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.