SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, Jun Zhu

TL;DR
This paper introduces SageAttention3, a highly efficient FP4 attention implementation leveraging new GPU hardware, and explores 8-bit attention for training large models, achieving near-lossless fine-tuning performance.
Contribution
It presents a novel FP4 attention acceleration using FP4 Tensor Cores and pioneers 8-bit attention for training, extending low-bit efficiency from inference to training tasks.
Findings
FP4 attention achieves 1038 TOPS on RTX5090, 5x faster than FlashAttention.
8-bit attention is lossless in fine-tuning tasks.
8-bit attention shows slower convergence in pretraining tasks.
Abstract
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jt-zhang/SageAttention2_plusmodel· ♡ 26♡ 26
- 🤗jt-zhang/SageAttention3model· ♡ 54♡ 54
- 🤗TurboDiffusion/TurboWan2.1-T2V-1.3B-480Pmodel· ♡ 26♡ 26
- 🤗TurboDiffusion/TurboWan2.2-I2V-A14B-720Pmodel· ♡ 157♡ 157
- 🤗TurboDiffusion/TurboWan2.1-T2V-14B-720Pmodel· ♡ 8♡ 8
- 🤗TurboDiffusion/TurboWan2.1-T2V-14B-480Pmodel· ♡ 10♡ 10
- 🤗vantagewithai/TurboWan2.2-I2V-A14B-720P-ComfyUImodel· ♡ 1♡ 1
- 🤗vantagewithai/TurboWan2.2-I2V-A14B-720P-ComfyUI-GGUFmodel· 254 dl· ♡ 1254 dl♡ 1
- 🤗vantagewithai/TurboWan2.1-T2V-14B-720P-ComfyUImodel
- 🤗vantagewithai/TurboWan2.1-T2V-14B-720P-ComfyUI-GGUFmodel· 85 dl· ♡ 285 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Big Data and Digital Economy
MethodsSoftmax · Attention Is All You Need · Focus
