Transformer Normalisation Layers and the Independence of Semantic Subspaces
Stephen Menary, Samuel Kaski, Andre Freitas

TL;DR
This paper analyzes how different normalization layers in transformers affect the independence of semantic subspaces, revealing that Pre-Norm can cause interference and circuit collapse, while QKV-Norm offers different trade-offs.
Contribution
It provides a theoretical and empirical comparison of normalization strategies in transformers, highlighting how Pre-Norm violates subspace independence and proposing QKV-Norm as an alternative.
Findings
Pre-Norm causes interference among subspaces due to shared normalization.
Circuit collapse occurs when attention shifts to different tokens, especially under perturbations.
QKV-Norm relaxes representational constraints but may perform worse out-of-distribution.
Abstract
Recent works have shown that transformers can solve contextual reasoning tasks by internally executing computational graphs called circuits. Circuits often use attention to logically match information from subspaces of the representation, e.g. using position-in-sequence to identify the previous token. In this work, we consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability unless the model learns a strict representation structure of orthogonal spheres. This is because it causes linear subspaces to interfere through their common normalisation factor. Theoretically, we analyse circuit stability by modelling this interference as random noise on the -norms of the query/key/value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Neural Networks and Applications
MethodsSoftmax · Attention Is All You Need
