Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Andrea Agazzi, Giuseppe Bruno, Eloy Mosig Garc\'ia, Samuele Saviozzi, Marco Romito

TL;DR
This paper establishes a rigorous mathematical framework for the evolution of tokens in deep transformer models, showing convergence to a stochastic particle system and analyzing synchronization effects induced by noise.
Contribution
It proves pathwise convergence of transformer token dynamics to a stochastic PDE and characterizes conditions for noise-induced synchronization.
Findings
Tokens' evolution converges to a stochastic PDE in the limit.
Synchronization by noise occurs under certain conditions.
Exponential dissipation of interaction energy is proven.
Abstract
We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is sufficiently coercive relative to the deterministic self-attention drift. We finally characterize the activation functions satisfying the former condition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
