Mechanics of Next Token Prediction with Self-Attention

Yingcong Li; Yixiao Huang; M. Emrullah Ildiz; Ankit Singh Rawat; Samet; Oymak

arXiv:2403.08081·cs.LG·March 14, 2024·1 cites

Mechanics of Next Token Prediction with Self-Attention

Yingcong Li, Yixiao Huang, M. Emrullah Ildiz, Ankit Singh Rawat, Samet, Oymak

PDF

Open Access

TL;DR

This paper analyzes how self-attention in transformer models learns to predict the next token by retrieving high-priority tokens and then combining them, providing a theoretical understanding of the underlying mechanics.

Contribution

It offers a rigorous theoretical framework showing that gradient descent trains self-attention to perform token retrieval and composition, formalized through graph structures and SCCs.

Findings

01

Self-attention learns to retrieve high-priority tokens associated with the last input.

02

It creates a convex combination of retrieved tokens for next-token sampling.

03

Gradient descent discovers strongly-connected components in token graphs.

Abstract

Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $What$ $does$ $a$ $single$ $self-attention$ $layer$ $learn$ $from$ $next-token$ $prediction?$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $(1)$ $Hard$ $retrieval:$ Given input sequence, self-attention precisely selects the $high-priority$ $input$ $tokens$ associated with the last input token. $(2)$ $Soft$ $composition:$ It then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetallurgy and Material Forming · Robotic Mechanisms and Dynamics