SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Jintao Zhang; Jia Wei; Pengle Zhang; Xiaoming Xu; Haofeng Huang; Haoxu Wang; Kai Jiang; Jianfei Chen; Jun Zhu

arXiv:2505.11594·cs.LG·January 16, 2026

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, Jun Zhu

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces SageAttention3, a highly efficient FP4 attention implementation leveraging new GPU hardware, and explores 8-bit attention for training large models, achieving near-lossless fine-tuning performance.

Contribution

It presents a novel FP4 attention acceleration using FP4 Tensor Cores and pioneers 8-bit attention for training, extending low-bit efficiency from inference to training tasks.

Findings

01

FP4 attention achieves 1038 TOPS on RTX5090, 5x faster than FlashAttention.

02

8-bit attention is lossless in fine-tuning tasks.

03

8-bit attention shows slower convergence in pretraining tasks.

Abstract

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-ml/SageAttention
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Big Data and Digital Economy

MethodsSoftmax · Attention Is All You Need · Focus