Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality
Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

TL;DR
This paper analyzes the training dynamics of multi-head softmax attention models for in-context learning, revealing phases of task allocation, convergence behavior, and optimality, supported by a novel spectral domain analysis.
Contribution
It provides the first convergence analysis for multi-head softmax attention, demonstrating task specialization emergence and optimality of gradient flow.
Findings
Gradient flow converges globally under suitable initialization.
Attention heads specialize on individual tasks during training.
The learned model is near-optimal compared to the best possible attention model.
Abstract
We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases -- a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Multi-Head Attention · Softmax
