SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Yizhao Gao; Zhichen Zeng; Dayou Du; Shijie Cao; Peiyuan Zhou; Jiaxing; Qi; Junjie Lai; Hayden Kwok-Hay So; Ting Cao; Fan Yang; Mao Yang

arXiv:2410.13276·cs.CL·February 18, 2025·2 cites

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing, Qi, Junjie Lai, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang

PDF

Open Access 1 Repo 9 Models 3 Reviews

TL;DR

SeerAttention introduces a learnable gating mechanism for dynamic, block-level sparse attention in LLMs, significantly improving efficiency and accuracy for long-context processing with minimal training overhead.

Contribution

It proposes a novel attention mechanism that learns sparsity patterns directly, outperforming existing heuristic-based methods in speed and accuracy.

Findings

01

Achieves faster inference with lower latency.

02

Improves accuracy on long-context tasks.

03

Requires only lightweight training of gate parameters.

Abstract

Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity hinders efficiency and scalability, especially for long-context processing. A promising approach is to leverage sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics at the attention head level, struggling to adapt dynamically to different contexts efficiently. We propose SeerAttention, a simple yet effective attention mechanism that directly learns the block-level attention sparsity from the LLM itself. Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate that selectively activates important blocks within the attention map. Specifically, the gate first pools the query (Q) and key (K) tensors along the sequence dimension and processes them…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

* SeerAttention operates at the block level, leading to high efficiency potential. * SeerAttention learns the sparsity pattern during fine-tuning, which is more flexible than heuristic sparsity. * An efficient SeerAttention kernel is provided and leads to a 5+x speedup over FlashAttention.

Weaknesses

* I understand that SeerAttention allows users to adjust the balance between sparsity and accuracy. However, how to determine the sparsity (or Top-k) in practice is not clear to me. I suggest the authors include a discussion on how to choose sparsity in order to maintain high accuracy.

Reviewer 02Rating 5Confidence 4

Strengths

- The proposed SeerAttention mechanism learns attention sparsity instead of relying on predefined patterns. This allows for better adaptation to different language tasks and models, as demonstrated by its performance across various experiments. - The development of a customized FlashAttention implementation enables efficient learning of the gating network by extracting the block-level ground truth of the attention map with minimum overhead. This not only improves the training process but also co

Weaknesses

- In section 3.1, the description is insufficient. It is hard to understand the proposed method only with the description in Section 3.1, but I found Figure 2 is easy to understand. I suggrest that more details around Figure 2 should be added in Section 3.1 to enhance understanding. For instance, the operations and significance of each component in the AttnGate module need to be elaborated. - In section 4.2, many symbols related to FlashAttention are used without prior explanation, making it dif

Reviewer 03Rating 5Confidence 4

Strengths

- Addressing the accuracy-efficiency trade-off of LLMs during long-context inference is a critical challenge, especially given the recent trend of employing LLMs to tackle increasingly complex problems with sophisticated inference processes. - The proposed approach, which learns a sparse attention distribution rather than relying on pre-defined attention patterns or heuristics to approximate sparsity, is intuitively effective in achieving a superior accuracy-efficiency trade-off. - The customi

Weaknesses

- **Limitations in Related Work Discussion**: Alleviating attention sparsity to enable long-context inference through attention optimization has been a significant area of research, with numerous discussions surrounding it. To provide a more comprehensive background, it would be beneficial for the authors to include additional related works on attention sparsity, particularly approaches that leverage KV cache eviction, encompassing both pre-defined patterns [1] and dynamic pattern attention spar

Code & Models

Repositories

microsoft/seerattention
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification

MethodsSoftmax · Attention Is All You Need