Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers
Giansalvo Cirrincione

TL;DR
This paper provides a detailed analysis of representational collapse in Transformers, revealing the roles of layer normalization, residual connections, and the MLP, and introduces a symmetry-breaking framework to unify various collapse phenomena.
Contribution
It clarifies the effects of layer normalization and residual connections on rank preservation, highlights the unique role of the MLP, and introduces a position-gated output projection to mitigate symmetry-related issues.
Findings
Layer normalization preserves affine rank exactly.
Residual connections prevent rank collapse in real Transformers.
Head-channel non-identifiability remains unresolved by the MLP.
Abstract
A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
