Clustering in Deep Stochastic Transformers
Lev Fedorov, Micha\"el E. Sander, Romuald Elie, Pierre Marion, Mathieu Lauri\`ere

TL;DR
This paper analyzes how stochastic initialization noise in deep Transformers prevents token collapse into a single point, revealing complex dynamics and phase transitions that influence model behavior and accuracy.
Contribution
It introduces a stochastic analysis of deep Transformers showing how initialization noise affects token clustering and reveals phase transitions in token configurations.
Findings
Initialization noise prevents token collapse into a single cluster.
A phase transition exists where antipodal configurations become stable.
Suppressing noise reduces model accuracy.
Abstract
Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsQuantum many-body systems · Stochastic Gradient Optimization Techniques · Quantum chaos and dynamical systems
