Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

Gautam Goel; Mahdi Soltanolkotabi; Peter Bartlett

arXiv:2603.01514·cs.LG·March 3, 2026

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

Gautam Goel, Mahdi Soltanolkotabi, Peter Bartlett

PDF

Open Access

TL;DR

This paper analyzes the training dynamics of softmax self-attention layers, demonstrating that a preconditioned gradient descent method can efficiently achieve global convergence in linear regression tasks.

Contribution

It introduces a novel structure-aware optimization algorithm with preconditioning and spectral initialization for training self-attention layers, ensuring fast global convergence.

Findings

01

Gradient descent converges to global optima at a geometric rate.

02

The proposed method avoids spurious stationary points effectively.

03

Spectral initialization places parameters near global minima with high probability.

Abstract

We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Model Reduction and Neural Networks