Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training
Seyed Morteza Emadi

TL;DR
This paper introduces a rank-aware concentration inequality for attention scores in transformers, enabling more precise overflow risk assessment and stable low-precision training, especially for FP8 precision.
Contribution
It derives a novel rank-aware bound on attention score magnitudes and applies it to develop geometry-aware scale factors for overflow prevention in low-precision transformer training.
Findings
Tighter concentration bounds for attention scores in low-rank settings
Effective overflow prevention in FP8 training across large models
Maintains accuracy while eliminating overflows in practical scenarios
Abstract
Attention scores in transformers are bilinear forms whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix has rank , tail probabilities for decay as rather than , where is a typicality parameter. For transformer attention where , this yields -- tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm via implicit power iteration, includes a grouped query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques · Model Reduction and Neural Networks
