Attn-QAT: 4-Bit Attention With Quantization-Aware Training
Peiyuan Zhang, Matthew Noto, Wenxuan Tan, Chengquan Jiang, Will Lin, Wei Zhou, Hao Zhang

TL;DR
This paper introduces Attn-QAT, a novel quantization-aware training method for 4-bit attention that stabilizes training and maintains model quality, enabling faster inference on GPUs with minimal accuracy loss.
Contribution
It presents the first systematic approach to 4-bit attention QAT, identifying key principles for stability and implementing efficient kernels for training and inference.
Findings
Attn-QAT recovers FP4 attention quality without outlier heuristics.
Achieves up to 1.5x speedup on RTX 5090.
Stabilizes 4-bit attention training with new principles.
Abstract
Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find that "drop-in" QAT, which naively combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) resolving implicit precision assumptions in FA's gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training as well as FP4 inference kernels. Across diffusion and language models, Attn-QAT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
