Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Yuandong Tian, Yiping Wang, Beidi Chen, Simon Du

TL;DR
This paper rigorously analyzes the training dynamics of a simple 1-layer Transformer, revealing how self-attention gradually discriminates among tokens during training, with implications for understanding model interpretability and inductive bias.
Contribution
It provides a mathematically rigorous analysis of the training dynamics of a 1-layer Transformer, uncovering the discriminative scanning behavior of self-attention and phase transition phenomena.
Findings
Self-attention acts as a discriminative scanning algorithm.
Attention weights decrease for common tokens over training.
Training dynamics exhibit a phase transition influenced by learning rates.
Abstract
Transformer architecture has shown impressive performance in multiple research domains and has become the backbone of many neural network models. However, there is limited understanding on how it works. In particular, with a simple predictive loss, how the representation emerges from the gradient \emph{training dynamics} remains a mystery. In this paper, for 1-layer transformer with one self-attention layer plus one decoder layer, we analyze its SGD training dynamics for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the self-attention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the self-attention layer, we prove that self-attention acts as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Machine Learning in Materials Science · Model Reduction and Neural Networks
MethodsStochastic Gradient Descent
