SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

Jintao Zhang; Haofeng Huang; Pengle Zhang; Jia Wei; Jun Zhu; Jianfei Chen

arXiv:2411.10958·cs.LG·October 2, 2025

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen

PDF

Open Access 2 Repos 1 Models

TL;DR

SageAttention2 introduces a highly efficient attention mechanism using 4-bit quantization and precision enhancement techniques, significantly outperforming existing methods in speed while maintaining high accuracy across diverse models.

Contribution

It proposes novel quantization and smoothing techniques for attention computation, achieving faster performance with minimal accuracy loss.

Findings

01

OPS surpasses FlashAttention2 and xformers by 3x and 4.5x on RTX4090

02

Matches FlashAttention3 speed on Hopper GPUs with higher accuracy

03

Negligible end-to-end metrics loss across language, image, and video models

Abstract

Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrices $(Q, K)$ to INT4 in a hardware-friendly thread-level granularity and quantize matrices $(P, V)$ to FP8. Second, we propose a method to smooth $Q$ , enhancing the accuracy of INT4 $Q K^{⊤}$ . Third, we propose a two-level accumulation strategy for $P V$ to enhance the accuracy of FP8 $P V$ . The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 4.5x on RTX4090, respectively.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
jt-zhang/SageAttention2_plus
model· ♡ 26
♡ 26

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · COVID-19 diagnosis using AI · Seismology and Earthquake Studies

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings