STS: Efficient Sparse Attention with Speculative Token Sparsity

Ceyu Xu; Jiangnan Yu; Yongji Wu; Yuan Xie

arXiv:2605.15508·cs.LG·May 19, 2026

STS: Efficient Sparse Attention with Speculative Token Sparsity

Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie

PDF

TL;DR

STS introduces a novel sparse attention method that uses a smaller draft model to identify important tokens, significantly reducing computation in large language models without retraining.

Contribution

It presents a no-retraining sparse attention mechanism leveraging draft model predictions to dynamically prune attention, improving efficiency and accuracy trade-offs.

Findings

01

Achieves 2.67x speedup at 90% sparsity on NarrativeQA

02

Maintains negligible accuracy loss compared to dense attention

03

Outperforms prior sparsity techniques in accuracy-sparsity trade-off

Abstract

The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.