Training Dynamics of Multi-Head Softmax Attention for In-Context   Learning: Emergence, Convergence, and Optimality

Siyu Chen; Heejune Sheen; Tianhao Wang; Zhuoran Yang

arXiv:2402.19442·cs.LG·June 11, 2024·3 cites

Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality

Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

PDF

Open Access

TL;DR

This paper analyzes the training dynamics of multi-head softmax attention models for in-context learning, revealing phases of task allocation, convergence behavior, and optimality, supported by a novel spectral domain analysis.

Contribution

It provides the first convergence analysis for multi-head softmax attention, demonstrating task specialization emergence and optimality of gradient flow.

Findings

01

Gradient flow converges globally under suitable initialization.

02

Attention heads specialize on individual tasks during training.

03

The learned model is near-optimal compared to the best possible attention model.

Abstract

We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases -- a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Multi-Head Attention · Softmax