SpecAttn: Speculating Sparse Attention

Harsh Shah

arXiv:2510.27641·cs.CL·November 3, 2025

SpecAttn: Speculating Sparse Attention

Harsh Shah

PDF

Open Access

TL;DR

SpecAttn is a training-free method that improves the efficiency of large language models by utilizing existing attention weights to enable sparse attention, reducing computation while maintaining output quality.

Contribution

It introduces SpecAttn, a novel approach that exploits draft model attention weights for efficient sparse attention in pre-trained transformers without additional training.

Findings

01

Achieves over 75% reduction in key-value cache accesses.

02

Increases perplexity by only 15.29% on PG-19 dataset.

03

Outperforms existing sparse attention methods.

Abstract

Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Machine Learning in Healthcare