Exact Attention Sensitivity and the Geometry of Transformer Stability
Seyed Morteza Emadi

TL;DR
This paper develops a stability theory for transformers, deriving exact operator norms and geometric bounds to explain training stability and the roles of normalization techniques, validated on large models.
Contribution
It introduces a first-principles framework for understanding transformer stability, deriving exact norms, and explaining the effects of normalization and depth on gradients.
Findings
Pre-LN preserves identity gradient paths.
DeepNorm's scaling emerges from attention matrix structure.
Attention sensitivity remains high throughout training.
Abstract
Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the \emph{exact} operator norm of the softmax Jacobian, , where the balanced-mass factor quantifies attention sensitivity. (2) We introduce a block-/RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's emerges from the quartic structure of attention's four projection matrices. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques · Machine Learning in Materials Science
