SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Jintao Zhang; Jia Wei; Haofeng Huang; Pengle Zhang; Jun Zhu; Jianfei Chen

arXiv:2410.02367·cs.LG·October 2, 2025·2 cites

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

SageAttention introduces a highly efficient 8-bit quantization method for transformer attention, significantly accelerating inference with minimal accuracy loss across various large-scale models.

Contribution

The paper proposes SageAttention, a novel quantization technique specifically designed for attention mechanisms, outperforming existing methods in speed and accuracy.

Findings

01

OPS outperforms FlashAttention2 and xformers by 2.1x and 2.7x

02

Achieves superior accuracy over FlashAttention3

03

Minimal end-to-end metrics loss across diverse models

Abstract

The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of $O (N^{2})$ , compared to $O (N)$ for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The contributions of this work are well motivated. 2. The proposed method seems to be novel although I am not an expert in this field. 3. The experiments are quite extensive, covering two different GPUs (RTX4090 and RTX3090), representative models for language, image, and video generation, and a wide range of datasets. 4. The results are quite impressive, showing more than two times speedup without performance degradation.

Weaknesses

1. Some design choices seem to be decided by the specific hardwares that are evaluated RTX4090 and 3090 (L271). Are those design choices also compatible with other GPUs like A100 and H100? 2. Table 7 shows that different model/task has different speedup. How is the speedup related to the specific transformer architecture, model size, and complexity of the task?

Reviewer 02Rating 6Confidence 4

Strengths

- Paper is well written. - Experiments are thorough. - Problem is challenging.

Weaknesses

- Full comparison to strong SOTA methods such as Flash attention 3, though slightly mentioned in the introduction and in Table 14, is not deeply explored. - Only targeted 4090/3000 series GPUs - it would be recommended to be tested on stronger GPUs at server level that is facing the strongest limitations. - It would be great to test across VLMs too.

Reviewer 03Rating 6Confidence 4

Strengths

+ The paper is well-written, with a logical structure and organization that facilitates understanding. + SageAttention shows competitive performance, outperforming FlashAttention2 and xformers by approximately 2.1x and 2.7x, respectively. + The method exhibits almost no end-to-end metrics loss across a variety of models, including large language models (LLMs), text-to-image (T2I), and text-to-video (T2V). + The discovery of channel-wise consistency, as illustrated in Figure 4, is particularly

Weaknesses

- The method relies heavily on FlashAttention, which may weaken its technical contribution and originality. What will the performance be if it does not employ the FlashAttention as the basis? - The reported superiority over FlashAttention3 appears to be quite marginal, raising questions about the significance of the improvements. - Another major weakness of this paper is that it does not compare SageAttention with other task-specific quantization methods, such as AWQ [1] for LLMs, Q-diffusion

Code & Models

Repositories

thu-ml/SageAttention
pytorchOfficial

Models

🤗
jt-zhang/SageAttention2_plus
model· ♡ 26
♡ 26

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Seismology and Earthquake Studies

MethodsSoftmax · Attention Is All You Need · Focus