How Sparse Attention Approximates Exact Attention? Your Attention is   Naturally $n^C$-Sparse

Yichuan Deng; Zhao Song; Jing Xiong; Chiwun Yang

arXiv:2404.02690·cs.LG·February 13, 2025·1 cites

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

Yichuan Deng, Zhao Song, Jing Xiong, Chiwun Yang

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of sparse attention, revealing it is inherently $n^C$-sparse and establishing conditions under which sparse attention approximates exact attention effectively, guiding more efficient model design.

Contribution

It introduces a theoretical framework that explains the inherent sparsity of standard attention and proposes adaptive strategies for improved sparse attention methods.

Findings

01

Attention is $n^{C}$-sparse, with only the largest $ ext{Omega}(n^{C})$ entries needed.

02

Stable $o( ext{log}(n))$-sparse attention cannot fully approximate attention due to persistent error.

03

Adaptive window size strategies outperform fixed ones in accuracy and efficiency for flexible context lengths.

Abstract

Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during the softmax function computation. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment. Despite its widespread use, a theoretical understanding of the conditions under which sparse attention performs on par with traditional attention remains elusive. This work aims to $bridge this gap by examining the inherent sparsity of standard attention processes$ . Our theoretical framework reveals several brand-new key insights: $∙$ Attention is $n^{C}$ -sparse, implying that considering only the largest $Ω (n^{C})$ entries out of all $n$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Computability, Logic, AI Algorithms

MethodsSparse Evolutionary Training