Perceptrons and localization of attention's mean-field landscape
Antonio \'Alvarez-L\'opez, Borjan Geshkovski, Dom\`enec Ruiz-Balet

TL;DR
This paper models Transformer forward passes as an interacting particle system on the sphere, analyzing the perceptron block's effect and showing that critical points are atomic and localized.
Contribution
It introduces a mean-field framework for Transformers and characterizes the localization of critical points due to the perceptron block.
Findings
Critical points are generically atomic and localized on subsets of the sphere.
The system can be viewed as a gradient flow for an explicit energy under certain weight settings.
Infinite context length limit is made rigorous via Wasserstein gradient flows.
Abstract
The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (mean-field) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
