Exact Attention Sensitivity and the Geometry of Transformer Stability

Seyed Morteza Emadi

arXiv:2602.18849·cs.LG·February 24, 2026

Exact Attention Sensitivity and the Geometry of Transformer Stability

Seyed Morteza Emadi

PDF

Open Access

TL;DR

This paper develops a stability theory for transformers, deriving exact operator norms and geometric bounds to explain training stability and the roles of normalization techniques, validated on large models.

Contribution

It introduces a first-principles framework for understanding transformer stability, deriving exact norms, and explaining the effects of normalization and depth on gradients.

Findings

01

Pre-LN preserves identity gradient paths.

02

DeepNorm's scaling emerges from attention matrix structure.

03

Attention sensitivity remains high throughout training.

Abstract

Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses $N^{- 1/4}$ scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the \emph{exact} operator norm of the softmax Jacobian, $∥ J_{so f t ma x} (u / τ) ∥_{\infty \to 1} = θ (p) / τ$ , where the balanced-mass factor $θ (p) \in [0, 1]$ quantifies attention sensitivity. (2) We introduce a block- $\infty$ /RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's $N^{- 1/4}$ emerges from the quartic structure of attention's four projection matrices. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques · Machine Learning in Materials Science