Clustering in Deep Stochastic Transformers

Lev Fedorov; Micha\"el E. Sander; Romuald Elie; Pierre Marion; Mathieu Lauri\`ere

arXiv:2601.21942·stat.ML·January 30, 2026

Clustering in Deep Stochastic Transformers

Lev Fedorov, Micha\"el E. Sander, Romuald Elie, Pierre Marion, Mathieu Lauri\`ere

PDF

Open Access

TL;DR

This paper analyzes how stochastic initialization noise in deep Transformers prevents token collapse into a single point, revealing complex dynamics and phase transitions that influence model behavior and accuracy.

Contribution

It introduces a stochastic analysis of deep Transformers showing how initialization noise affects token clustering and reveals phase transitions in token configurations.

Findings

01

Initialization noise prevents token collapse into a single cluster.

02

A phase transition exists where antipodal configurations become stable.

03

Suppressing noise reduces model accuracy.

Abstract

Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuantum many-body systems · Stochastic Gradient Optimization Techniques · Quantum chaos and dynamical systems