SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

Jintao Zhang; Chendong Xiang; Haofeng Huang; Jia Wei; Haocheng Xi; Jun Zhu; Jianfei Chen

arXiv:2502.18137·cs.LG·November 20, 2025

SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen

PDF

Open Access 1 Repo 1 Models

TL;DR

SpargeAttention introduces a universal, training-free sparse attention method that accelerates various models by accurately predicting and skipping unnecessary computations without compromising performance.

Contribution

The paper presents SpargeAttn, a novel universal sparse attention mechanism that uses a two-stage online filtering process to speed up model inference across diverse domains.

Findings

01

Significantly accelerates language, image, and video models

02

Maintains end-to-end performance metrics

03

Operates without additional training or overhead

Abstract

An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-ml/spargeattn
pytorchOfficial

Models

🤗
Xiang-cd/sparge-attention-model-zoo
model· ♡ 6
♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Focus