TL;DR
This paper develops an analytical theory for initializing deep transformers, identifying two failure modes—rank and entropy collapse—and provides a method to choose hyperparameters that ensure trainability.
Contribution
It offers a unified theoretical framework for understanding and avoiding failure modes in transformer initializations, including a practical algorithm for optimal hyper-parameter selection.
Findings
Identifies two failure modes: rank collapse and entropy collapse.
Provides a simple algorithm for computing trainability diagrams.
Quantitatively predicts initialization scales for stable training.
Abstract
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We…
Peer Reviews
Decision·ICLR 2026 Poster
The paper's strengths lie in its rigorous theoretical approach to a critical practical problem. Strong Theoretical Analysis: The paper provides a comprehensive and asymptotically exact analytical framework. It goes beyond simple heuristics to derive precise update equations for token similarity (Result 1) and an exact expression for the gradient norm at initialization (Result 2). This allows it to analyze both forward signal propagation and backward gradient flow, providing a complete picture o
My main concern is the lack of convincing experiments supporting the framework. While it is totally fine that the paper's focus is theoretical, it's a shame that there is no better evidence to support what is ultimately a very important practical problem - specifically when strong evidence would not be so difficult to provide. The core results are derived in the "limit of infinite sequence length." The authors acknowledge this creates "finite-size effects" and a "discrepancy between theory and s
1. Unifying view of two failure modes of attention is novel and valuable for the signal propagation community. I think that working under less strict assumptions than previous work (uniform attention, separate treatment of nominator and denominator in attention) is valuable. 2. Using ERM in the context of signal propagation analysis is novel and original. 3. The part of the paper that focuses on a single self-attention layer is good quality and well presented. It’s commendable that the authors e
I’m open to increasing my score if the authors address the following weaknesses: 1. The analysis of the average cosine similarity while passing through the full Transformer block (Section 4) is rushed. I believe some other parts could be shortened (like the setup in Section 2) in order to make space for a more thorough analysis of the full Transformer block. The result from eq. 14 is not interpreted and authors do not clearly walk the reader through how they arrive at Algorithm 1 from it. I bel
- The paper provides a precise asymptotic analysis on the evolution of the average cosine similarity and the average IRP with respect to the initialization scaling.
- There are many unclear arguments. - Why is the variance of the attention scores $\sigma_a^2$? (Doesn't it depend on the value of the variance of $X_t$?) - Sec B.1 L666 says $\sigma_Q^2=\sigma_K^2=\sigma_a^{\color{red}2}/d$ which is different from eq (8). Also the conclusion in (15), $Cov(a_{ts}a_{\tau\sigma})=\sigma_a^2q_{ts}q_{s\sigma}$ is $d^2$ times larger than eq (7). - Why does the attention score variance should $O(\log T)$? - The asymptotic prediction (Result 1) does not pe
Videos
Taxonomy
TopicsIntegrated Circuits and Semiconductor Failure Analysis · Ultrasonics and Acoustic Wave Propagation · Advancements in Semiconductor Devices and Circuit Design
MethodsSoftmax · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection
