Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows
Alex Massucco, Leonardo Del Grande, Marcello Carioni, Christoph Brune, Carola-Bibiane Sch\"onlieb

TL;DR
This paper models multi-headed transformer data flow as time-dependent Wasserstein gradient flows, providing theoretical analysis of stability, convergence, and asymptotic behavior, supported by numerical experiments.
Contribution
It introduces a novel mathematical framework linking transformer architectures to gradient flows, enabling rigorous analysis of their dynamics and stability.
Findings
Gradient flows have stationary points at limiting weight distributions.
The models are robust to noisy inputs and initial data perturbations.
Gradient flows converge under Gamma-convergence of the interaction energy.
Abstract
In recent years, transformer architectures have revolutionized the field of language processing, opening the door to previously unforeseen possibilities. However, from a theoretical point of view, the mathematical models proposed in the literature often lack direct contact with the actual architectures and depend on strong simplifying assumptions. In this paper, we reduce this gap by modelling the data flow in multi-headed transformer architectures as time-dependent gradient flows for a suitable interaction energy capturing the design of the attention mechanism. The explicit dependence on time allows us to consider different weights for each head and for each layer, without imposing constraints on the initialization method. Moreover, we prove that, under a suitable integrability assumption on the evolution of the weights, each element of the -limit set of the gradient flows is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
