SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Yizhao Gao; Shuming Guo; Shijie Cao; Yuqing Xia; Yu Cheng; Lei Wang; Lingxiao Ma; Yutao Sun; Tianzhu Ye; Li Dong; Hayden Kwok-Hay So; Yu Hua; Ting Cao; Fan Yang; Mao Yang

arXiv:2506.08889·cs.LG·June 11, 2025

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang

PDF

Open Access 1 Repo

TL;DR

SeerAttention-R is a novel sparse attention framework designed for long reasoning tasks, enabling efficient auto-regressive decoding with minimal accuracy loss and significant speedups on GPU hardware.

Contribution

It introduces SeerAttention-R, a flexible sparse attention method that maintains reasoning accuracy with minimal training data and achieves substantial speed improvements in decoding.

Findings

01

Maintains near-lossless reasoning accuracy with 4K token context.

02

Achieves up to 9x speedup over FlashAttention-3 on H100 GPU.

03

Trained on only 0.4B tokens, demonstrating data efficiency.

Abstract

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/seerattention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need