Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith

TL;DR
This paper analyzes the behavior of deep encoder-only transformers at inference time in the low-temperature regime, showing how token distributions concentrate and evolve over time using mean-field theory and Wasserstein metrics.
Contribution
It introduces a mean-field framework to describe token evolution in transformers and proves rapid concentration of token distributions onto a projected initial distribution.
Findings
Token distribution concentrates onto a projected initial distribution.
The Wasserstein distance scales with temperature and time, indicating rapid convergence.
Numerical experiments validate the theoretical concentration and reveal a terminal phase dominated by the value matrix spectrum.
Abstract
Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like in terms of the temperature parameter and inference time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
