Mechanics of Next Token Prediction with Self-Attention
Yingcong Li, Yixiao Huang, M. Emrullah Ildiz, Ankit Singh Rawat, Samet, Oymak

TL;DR
This paper analyzes how self-attention in transformer models learns to predict the next token by retrieving high-priority tokens and then combining them, providing a theoretical understanding of the underlying mechanics.
Contribution
It offers a rigorous theoretical framework showing that gradient descent trains self-attention to perform token retrieval and composition, formalized through graph structures and SCCs.
Findings
Self-attention learns to retrieve high-priority tokens associated with the last input.
It creates a convex combination of retrieved tokens for next-token sampling.
Gradient descent discovers strongly-connected components in token graphs.
Abstract
Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: Given input sequence, self-attention precisely selects the associated with the last input token. It then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetallurgy and Material Forming · Robotic Mechanisms and Dynamics
