The emergence of clusters in self-attention dynamics

Borjan Geshkovski; Cyril Letrouit; Yury Polyanskiy; Philippe Rigollet

arXiv:2305.05465·cs.LG·February 14, 2024·5 cites

The emergence of clusters in self-attention dynamics

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet

PDF

Open Access 1 Repo 2 Videos

TL;DR

This paper models Transformer self-attention as a particle system, showing that token representations tend to cluster based on initial tokens and that the limiting structures depend on the value matrix spectrum, confirming empirical observations.

Contribution

It introduces a mathematical framework using dynamical systems to analyze self-attention, revealing clustering behavior and the influence of the value matrix spectrum on learned representations.

Findings

01

Tokens form clusters determined by initial positions.

02

Self-attention matrices converge to low-rank Boolean matrices in 1D.

03

Clustering behavior confirms the emergence of leaders in token sequences.

Abstract

Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence of tokens when processed by Transformers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

borjang/2023-transformers
noneOfficial

Videos

The emergence of clusters in self-attention dynamics· youtube

The emergence of clusters in self-attention dynamics· slideslive

Taxonomy

TopicsNeural Networks and Applications · Neural dynamics and brain function · Statistical Mechanics and Entropy