Scan and Snap: Understanding Training Dynamics and Token Composition in   1-layer Transformer

Yuandong Tian; Yiping Wang; Beidi Chen; Simon Du

arXiv:2305.16380·cs.CL·October 31, 2023·6 cites

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

Yuandong Tian, Yiping Wang, Beidi Chen, Simon Du

PDF

Open Access 1 Video

TL;DR

This paper rigorously analyzes the training dynamics of a simple 1-layer Transformer, revealing how self-attention gradually discriminates among tokens during training, with implications for understanding model interpretability and inductive bias.

Contribution

It provides a mathematically rigorous analysis of the training dynamics of a 1-layer Transformer, uncovering the discriminative scanning behavior of self-attention and phase transition phenomena.

Findings

01

Self-attention acts as a discriminative scanning algorithm.

02

Attention weights decrease for common tokens over training.

03

Training dynamics exhibit a phase transition influenced by learning rates.

Abstract

Transformer architecture has shown impressive performance in multiple research domains and has become the backbone of many neural network models. However, there is limited understanding on how it works. In particular, with a simple predictive loss, how the representation emerges from the gradient \emph{training dynamics} remains a mystery. In this paper, for 1-layer transformer with one self-attention layer plus one decoder layer, we analyze its SGD training dynamics for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the self-attention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the self-attention layer, we prove that self-attention acts as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer· slideslive

Taxonomy

TopicsNeural Networks and Applications · Machine Learning in Materials Science · Model Reduction and Neural Networks

MethodsStochastic Gradient Descent