SageBwd: A Trainable Low-bit Attention

Jintao Zhang; Marco Chen; Haoxu Wang; Kai Jiang; Ion Stoica; Joseph E. Gonzalez; Jianfei Chen; Jun Zhu

arXiv:2603.02170·cs.LG·March 3, 2026

SageBwd: A Trainable Low-bit Attention

Jintao Zhang, Marco Chen, Haoxu Wang, Kai Jiang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu

PDF

Open Access

TL;DR

SageBwd introduces a trainable low-bit attention method that effectively matches full-precision attention during pre-training by addressing quantization errors and stability issues, enabling faster inference without performance loss.

Contribution

This work demonstrates that SageBwd can match full-precision attention during pre-training by analyzing quantization errors and stability factors, improving low-bit attention training methods.

Findings

01

QK-norm is essential for stable training at large tokens per step.

02

Quantization errors mainly come from the backward-pass score gradient dS.

03

Reducing tokens per step helps SageBwd match full-precision performance.

Abstract

Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning