The Mean-Field Dynamics of Transformers
Philippe Rigollet

TL;DR
This paper introduces a mathematical framework interpreting Transformer attention as a particle system, revealing how tokens cluster over time and how normalization affects this process, providing insights into the dynamics of deep attention models.
Contribution
It develops a mean-field theory for Transformers, connecting attention dynamics to Wasserstein flows and clustering phenomena, and analyzes the effects of normalization and phase transitions.
Findings
Tokens asymptotically form multiple clusters after metastable states
Normalization schemes influence contraction speeds and clustering behavior
Identifies a phase transition affecting long-context attention
Abstract
We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural dynamics and brain function · Micro and Nano Robotics · Quantum many-body systems
