TL;DR
This paper analyzes how various normalization schemes in deep transformers influence token representation dynamics, clustering, and collapse, providing a unified framework that clarifies their effects and identifies Peri-LN as especially effective.
Contribution
It introduces a particle-based model to unify and analyze the impact of different normalization schemes on transformer token representations.
Findings
Normalization acts as speed regulation in token evolution.
Peri-LN is identified as a particularly effective normalization scheme.
The framework explains how normalization influences clustering and collapse in representations.
Abstract
We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes -- including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT -- revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
