Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning
Gautam Goel, Mahdi Soltanolkotabi, Peter Bartlett

TL;DR
This paper analyzes the training dynamics of softmax self-attention layers, demonstrating that a preconditioned gradient descent method can efficiently achieve global convergence in linear regression tasks.
Contribution
It introduces a novel structure-aware optimization algorithm with preconditioning and spectral initialization for training self-attention layers, ensuring fast global convergence.
Findings
Gradient descent converges to global optima at a geometric rate.
The proposed method avoids spurious stationary points effectively.
Spectral initialization places parameters near global minima with high probability.
Abstract
We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Model Reduction and Neural Networks
