Normalization in Attention Dynamics

Nikita Karagodin; Shu Ge; Yury Polyanskiy; Philippe Rigollet

arXiv:2510.22026·cs.LG·November 12, 2025

Normalization in Attention Dynamics

Nikita Karagodin, Shu Ge, Yury Polyanskiy, Philippe Rigollet

PDF

1 Video

TL;DR

This paper analyzes how various normalization schemes in deep transformers influence token representation dynamics, clustering, and collapse, providing a unified framework that clarifies their effects and identifies Peri-LN as especially effective.

Contribution

It introduces a particle-based model to unify and analyze the impact of different normalization schemes on transformer token representations.

Findings

01

Normalization acts as speed regulation in token evolution.

02

Peri-LN is identified as a particularly effective normalization scheme.

03

The framework explains how normalization influences clustering and collapse in representations.

Abstract

We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes -- including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT -- revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Normalization in Attention Dynamics· slideslive