Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability
Jinwoo Baek

TL;DR
This paper develops a layer-wise theoretical framework to predict, explain, and mitigate instability in low-precision transformer models, providing practical diagnostics and stabilization techniques.
Contribution
It introduces a first-order, module-wise theory for transformer instability, including new bounds, diagnostics, and a stabilization method based on LayerNorm adjustments.
Findings
The combined predictor accurately tracks precision mismatches across seeds and widths.
The maximum softmax sensitivity acts as an early warning for error spikes.
A small LayerNorm tweak effectively stabilizes models with minimal overhead.
Abstract
Transformers trained in low precision can suffer forward-error amplification. We give a first-order, module-wise theory that predicts when and where errors grow. For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics: a score-scale ratio , a rowwise softmax sensitivity , and value conditioning . We prove a residual relaxation inequality showing that residual blocks attenuate depth-wise accumulation, and we introduce a precision- and width-aware LayerNorm indicator with a matching first-order bound in the -dominated regime. These pieces yield a unified forward-stability bound whose right-hand side is directly estimable during training. On Tiny-ViT/CIFAR-10 we evaluate the bound and components. (1) The combined predictor $\kappa_{\rm softmax},(1+\kappa_{\rm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
