Transformer Normalisation Layers and the Independence of Semantic   Subspaces

Stephen Menary; Samuel Kaski; Andre Freitas

arXiv:2406.17837·cs.LG·June 27, 2024

Transformer Normalisation Layers and the Independence of Semantic Subspaces

Stephen Menary, Samuel Kaski, Andre Freitas

PDF

Open Access

TL;DR

This paper analyzes how different normalization layers in transformers affect the independence of semantic subspaces, revealing that Pre-Norm can cause interference and circuit collapse, while QKV-Norm offers different trade-offs.

Contribution

It provides a theoretical and empirical comparison of normalization strategies in transformers, highlighting how Pre-Norm violates subspace independence and proposing QKV-Norm as an alternative.

Findings

01

Pre-Norm causes interference among subspaces due to shared normalization.

02

Circuit collapse occurs when attention shifts to different tokens, especially under perturbations.

03

QKV-Norm relaxes representational constraints but may perform worse out-of-distribution.

Abstract

Recent works have shown that transformers can solve contextual reasoning tasks by internally executing computational graphs called circuits. Circuits often use attention to logically match information from subspaces of the representation, e.g. using position-in-sequence to identify the previous token. In this work, we consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability unless the model learns a strict representation structure of orthogonal spheres. This is because it causes linear subspaces to interfere through their common normalisation factor. Theoretically, we analyse circuit stability by modelling this interference as random noise on the $L_{2}$ -norms of the query/key/value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Neural Networks and Applications

MethodsSoftmax · Attention Is All You Need