How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse
Yichuan Deng, Zhao Song, Jing Xiong, Chiwun Yang

TL;DR
This paper provides a theoretical analysis of sparse attention, revealing it is inherently $n^C$-sparse and establishing conditions under which sparse attention approximates exact attention effectively, guiding more efficient model design.
Contribution
It introduces a theoretical framework that explains the inherent sparsity of standard attention and proposes adaptive strategies for improved sparse attention methods.
Findings
Attention is $n^{C}$-sparse, with only the largest $ ext{Omega}(n^{C})$ entries needed.
Stable $o( ext{log}(n))$-sparse attention cannot fully approximate attention due to persistent error.
Adaptive window size strategies outperform fixed ones in accuracy and efficiency for flexible context lengths.
Abstract
Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during the softmax function computation. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment. Despite its widespread use, a theoretical understanding of the conditions under which sparse attention performs on par with traditional attention remains elusive. This work aims to . Our theoretical framework reveals several brand-new key insights: Attention is -sparse, implying that considering only the largest entries out of all …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Computability, Logic, AI Algorithms
MethodsSparse Evolutionary Training
