Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability

Jinwoo Baek

arXiv:2510.21770·cs.LG·October 28, 2025

Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability

Jinwoo Baek

PDF

TL;DR

This paper develops a layer-wise theoretical framework to predict, explain, and mitigate instability in low-precision transformer models, providing practical diagnostics and stabilization techniques.

Contribution

It introduces a first-order, module-wise theory for transformer instability, including new bounds, diagnostics, and a stabilization method based on LayerNorm adjustments.

Findings

01

The combined predictor accurately tracks precision mismatches across seeds and widths.

02

The maximum softmax sensitivity acts as an early warning for error spikes.

03

A small LayerNorm tweak effectively stabilizes models with minimal overhead.

Abstract

Transformers trained in low precision can suffer forward-error amplification. We give a first-order, module-wise theory that predicts when and where errors grow. For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics: a score-scale ratio $κ_{score}$ , a rowwise softmax sensitivity $κ_{softmax}$ , and value conditioning $κ (V)$ . We prove a residual relaxation inequality showing that residual blocks attenuate depth-wise accumulation, and we introduce a precision- and width-aware LayerNorm indicator $ρ_{LN}$ with a matching first-order bound in the $ϵ$ -dominated regime. These pieces yield a unified forward-stability bound whose right-hand side is directly estimable during training. On Tiny-ViT/CIFAR-10 we evaluate the bound and components. (1) The combined predictor $\kappa_{\rm softmax},(1+\kappa_{\rm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.