SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen

TL;DR
SageAttention2 introduces a highly efficient attention mechanism using 4-bit quantization and precision enhancement techniques, significantly outperforming existing methods in speed while maintaining high accuracy across diverse models.
Contribution
It proposes novel quantization and smoothing techniques for attention computation, achieving faster performance with minimal accuracy loss.
Findings
OPS surpasses FlashAttention2 and xformers by 3x and 4.5x on RTX4090
Matches FlashAttention3 speed on Hopper GPUs with higher accuracy
Negligible end-to-end metrics loss across language, image, and video models
Abstract
Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrices to INT4 in a hardware-friendly thread-level granularity and quantize matrices to FP8. Second, we propose a method to smooth , enhancing the accuracy of INT4 . Third, we propose a two-level accumulation strategy for to enhance the accuracy of FP8 . The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 4.5x on RTX4090, respectively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · COVID-19 diagnosis using AI · Seismology and Earthquake Studies
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
