Specialization of softmax attention heads: insights from the high-dimensional single-location model
M. Sagitova, O. Duranthon, L. Zdeborov\'a

TL;DR
This paper investigates how multi-head softmax attention in transformers develops specialized heads during training, analyzes the training dynamics, and proposes a new attention mechanism that improves performance by reducing noise.
Contribution
It introduces a theoretical model of head specialization, analyzes training phases, and proposes Bayes-softmax attention for better prediction accuracy.
Findings
Heads specialize sequentially during training.
Softmax-1 reduces noise from irrelevant heads.
Bayes-softmax achieves optimal prediction performance.
Abstract
Multi-head attention enables transformer models to represent multiple attention patterns simultaneously. Empirically, head specialization emerges in distinct stages during training, while many heads remain redundant and learn similar representations. We propose a theoretical model capturing this phenomenon, based on the multi-index and single-location regression frameworks. In the first part, we analyze the training dynamics of multi-head softmax attention under SGD, revealing an initial unspecialized phase followed by a multi-stage specialization phase in which different heads sequentially align with latent signal directions. In the second part, we study the impact of attention activation functions on performance. We show that softmax-1 significantly reduces noise from irrelevant heads. Finally, we introduce the Bayes-softmax attention, which achieves optimal prediction performance in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Advanced Memory and Neural Computing · Visual Attention and Saliency Detection
