SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing, Qi, Junjie Lai, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang

TL;DR
SeerAttention introduces a learnable gating mechanism for dynamic, block-level sparse attention in LLMs, significantly improving efficiency and accuracy for long-context processing with minimal training overhead.
Contribution
It proposes a novel attention mechanism that learns sparsity patterns directly, outperforming existing heuristic-based methods in speed and accuracy.
Findings
Achieves faster inference with lower latency.
Improves accuracy on long-context tasks.
Requires only lightweight training of gate parameters.
Abstract
Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity hinders efficiency and scalability, especially for long-context processing. A promising approach is to leverage sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics at the attention head level, struggling to adapt dynamically to different contexts efficiently. We propose SeerAttention, a simple yet effective attention mechanism that directly learns the block-level attention sparsity from the LLM itself. Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate that selectively activates important blocks within the attention map. Specifically, the gate first pools the query (Q) and key (K) tensors along the sequence dimension and processes them…
Peer Reviews
Decision·Submitted to ICLR 2025
* SeerAttention operates at the block level, leading to high efficiency potential. * SeerAttention learns the sparsity pattern during fine-tuning, which is more flexible than heuristic sparsity. * An efficient SeerAttention kernel is provided and leads to a 5+x speedup over FlashAttention.
* I understand that SeerAttention allows users to adjust the balance between sparsity and accuracy. However, how to determine the sparsity (or Top-k) in practice is not clear to me. I suggest the authors include a discussion on how to choose sparsity in order to maintain high accuracy.
- The proposed SeerAttention mechanism learns attention sparsity instead of relying on predefined patterns. This allows for better adaptation to different language tasks and models, as demonstrated by its performance across various experiments. - The development of a customized FlashAttention implementation enables efficient learning of the gating network by extracting the block-level ground truth of the attention map with minimum overhead. This not only improves the training process but also co
- In section 3.1, the description is insufficient. It is hard to understand the proposed method only with the description in Section 3.1, but I found Figure 2 is easy to understand. I suggrest that more details around Figure 2 should be added in Section 3.1 to enhance understanding. For instance, the operations and significance of each component in the AttnGate module need to be elaborated. - In section 4.2, many symbols related to FlashAttention are used without prior explanation, making it dif
- Addressing the accuracy-efficiency trade-off of LLMs during long-context inference is a critical challenge, especially given the recent trend of employing LLMs to tackle increasingly complex problems with sophisticated inference processes. - The proposed approach, which learns a sparse attention distribution rather than relying on pre-defined attention patterns or heuristics to approximate sparsity, is intuitively effective in achieving a superior accuracy-efficiency trade-off. - The customi
- **Limitations in Related Work Discussion**: Alleviating attention sparsity to enable long-context inference through attention optimization has been a significant area of research, with numerous discussions surrounding it. To provide a more comprehensive background, it would be beneficial for the authors to include additional related works on attention sparsity, particularly approaches that leverage KV cache eviction, encompassing both pre-defined patterns [1] and dynamic pattern attention spar
Code & Models
- 🤗SeerAttention/SeerAttention-Llama-3.1-8Bmodel· 6 dl· ♡ 46 dl♡ 4
- 🤗SeerAttention/SeerAttention-Llama-3.1-8B-AttnGatesmodel· 4.6k dl· ♡ 44.6k dl♡ 4
- 🤗SeerAttention/SeerAttention-Qwen2.5-7B-AttnGatesmodel· 16 dl· ♡ 116 dl♡ 1
- 🤗SeerAttention/SeerAttention-Qwen2.5-14B-AttnGatesmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗SeerAttention/SeerAttention-Qwen2.5-32B-AttnGatesmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗SeerAttention/SeerAttention-Llama-3.1-70B-AttnGatesmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-32B-AttnGatesmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-14B-AttnGatesmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗SeerAttention/SeerAttention-QwQ-32B-AttnGatesmodel· 10 dl· ♡ 410 dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification
MethodsSoftmax · Attention Is All You Need
