Attention's forward pass and Frank-Wolfe
Albert Alcalde, Borjan Geshkovski, Dom\`enec Ruiz-Balet

TL;DR
This paper analyzes the behavior of self-attention in the zero-temperature limit, revealing its connection to Frank-Wolfe optimization, and studies the dynamics and metastability of the process at finite temperature.
Contribution
It establishes a novel link between self-attention's hardmax limit and Frank-Wolfe steps, and characterizes the long-term dynamics and metastability in finite-temperature regimes.
Findings
Hardmax limit induces Voronoi diagram structure in token dynamics.
Tokens contract to a single point or move along straight lines depending on matrix definiteness.
Finite-temperature dynamics exhibit metastability with exponential time scales.
Abstract
We study the hardmax limit of self-attention dynamics for token embeddings obtained in the zero-temperature () regime, and relate it to the finite- setting. In this limit, the update rule can be viewed as a Frank-Wolfe step for a quadratic objective over the convex hull of the current token embeddings. When the key-query matrix is negative semidefinite, the method linearly contracts all tokens to a single cluster at the origin. When it is positive semidefinite, extending the hardmax rule to the entire convex hull induces a Voronoi diagram: vertices are stationary, interior points remain in their initial cells, and each token moves along a straight line toward its cell's vertex, yielding (super-)exponential convergence. As a byproduct, we also establish well-posedness of the associated ODE limit in this regime. Returning to the finite- regime, we model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods · Random Matrices and Applications · Stochastic Gradient Optimization Techniques
