Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Peiyuan Zhang; Matthew Noto; Wenxuan Tan; Chengquan Jiang; Will Lin; Wei Zhou; Hao Zhang

arXiv:2603.00040·cs.LG·March 10, 2026

Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Peiyuan Zhang, Matthew Noto, Wenxuan Tan, Chengquan Jiang, Will Lin, Wei Zhou, Hao Zhang

PDF

Open Access

TL;DR

This paper introduces Attn-QAT, a novel quantization-aware training method for 4-bit attention that stabilizes training and maintains model quality, enabling faster inference on GPUs with minimal accuracy loss.

Contribution

It presents the first systematic approach to 4-bit attention QAT, identifying key principles for stability and implementing efficient kernels for training and inference.

Findings

01

Attn-QAT recovers FP4 attention quality without outlier heuristics.

02

Achieves up to 1.5x speedup on RTX 5090.

03

Stabilizes 4-bit attention training with new principles.

Abstract

Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find that "drop-in" QAT, which naively combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) resolving implicit precision assumptions in FA's gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training as well as FP4 inference kernels. Across diffusion and language models, Attn-QAT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis