SpecAttn: Speculating Sparse Attention
Harsh Shah

TL;DR
SpecAttn is a training-free method that improves the efficiency of large language models by utilizing existing attention weights to enable sparse attention, reducing computation while maintaining output quality.
Contribution
It introduces SpecAttn, a novel approach that exploits draft model attention weights for efficient sparse attention in pre-trained transformers without additional training.
Findings
Achieves over 75% reduction in key-value cache accesses.
Increases perplexity by only 15.29% on PG-19 dataset.
Outperforms existing sparse attention methods.
Abstract
Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Machine Learning in Healthcare
