Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention
Ayan Pendharkar

TL;DR
This paper develops a theoretical framework for understanding the dynamics of multi-head self-attention in transformers, revealing conditions for clustering, stability, and entropy increase, thus clarifying the mechanisms behind attention behavior.
Contribution
It introduces a novel energy functional for multi-head attention, identifies key obstructions to monotonicity, and derives conditions for clustering and stability in transformer models.
Findings
Multi-head attention energy functional is non-decreasing under certain conditions.
Heterogeneous heads exhibit super-additive clustering rates.
Attention entropy increases monotonically toward equilibrium.
Abstract
Transformer self-attention can be interpreted as a gradient flow on the unit sphere, in which tokens evolve under softmax interaction potentials and tend to form clusters. While prior work has established clustering behavior for single-head attention, the multi-head setting remains less understood due to geometric interference between heads, which invalidates standard monotonicity arguments. In this work, we develop a theoretical framework for multi-head self-attention dynamics and resolve several open questions. We show that, under suitable conditions on the score matrices, a natural multi-head energy functional is non-decreasing along both flat and spherical dynamics. We identify the key obstruction to per-head monotonicity as radial shadow terms, which are projections of each head's output onto token directions, persisting even under orthogonality assumptions. We introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
