Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

Albert Alcalde; Leon Bungert; Konstantin Riedl; Tim Roith

arXiv:2605.10931·math.AP·May 12, 2026

Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith

PDF

TL;DR

This paper analyzes the behavior of deep encoder-only transformers at inference time in the low-temperature regime, showing how token distributions concentrate and evolve over time using mean-field theory and Wasserstein metrics.

Contribution

It introduces a mean-field framework to describe token evolution in transformers and proves rapid concentration of token distributions onto a projected initial distribution.

Findings

01

Token distribution concentrates onto a projected initial distribution.

02

The Wasserstein distance scales with temperature and time, indicating rapid convergence.

03

Numerical experiments validate the theoretical concentration and reveal a terminal phase dominated by the value matrix spectrum.

Abstract

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $lo g (β + 1) / β exp (C t) + exp (- c t)$ in terms of the temperature parameter $β^{- 1} \to 0$ and inference time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.