Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training

Seyed Morteza Emadi

arXiv:2602.18851·cs.LG·February 24, 2026

Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training

Seyed Morteza Emadi

PDF

Open Access

TL;DR

This paper introduces a rank-aware concentration inequality for attention scores in transformers, enabling more precise overflow risk assessment and stable low-precision training, especially for FP8 precision.

Contribution

It derives a novel rank-aware bound on attention score magnitudes and applies it to develop geometry-aware scale factors for overflow prevention in low-precision transformer training.

Findings

01

Tighter concentration bounds for attention scores in low-rank settings

02

Effective overflow prevention in FP8 training across large models

03

Maintains accuracy while eliminating overflows in practical scenarios

Abstract

Attention scores in transformers are bilinear forms $S_{ij} = x_{i}^{⊤} M x_{j} / d_{h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^{Q} W^{K ⊤}$ has rank $r ≪ d$ , tail probabilities for $max_{i, j} ∣ S_{ij} ∣$ decay as $exp (- d^{2} α^{2} / (γ r))$ rather than $exp (- d α^{2})$ , where $γ > 1$ is a typicality parameter. For transformer attention where $r = d_{h}$ , this yields $8$ -- $28 \times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $∥ W^{Q} W^{K ⊤} ∥_{2}$ via implicit power iteration, includes a grouped query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques · Model Reduction and Neural Networks