STS: Efficient Sparse Attention with Speculative Token Sparsity
Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie

TL;DR
STS introduces a novel sparse attention method that uses a smaller draft model to identify important tokens, significantly reducing computation in large language models without retraining.
Contribution
It presents a no-retraining sparse attention mechanism leveraging draft model predictions to dynamically prune attention, improving efficiency and accuracy trade-offs.
Findings
Achieves 2.67x speedup at 90% sparsity on NarrativeQA
Maintains negligible accuracy loss compared to dense attention
Outperforms prior sparsity techniques in accuracy-sparsity trade-off
Abstract
The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
