Towards Understanding the Universality of Transformers for Next-Token Prediction
Michael E. Sander, Gabriel Peyr\'e

TL;DR
This paper investigates how causal Transformers can predict the next token in sequences by studying their approximation capabilities, focusing on specific functions like linear and periodic mappings, with theoretical proofs and experimental validation.
Contribution
It provides a theoretical analysis of Transformers' ability to learn sequence mappings using causal kernel descent, connecting it to the Kaczmarz algorithm, and validates findings experimentally.
Findings
Transformers can learn specific sequence functions like linear and periodic mappings.
The causal kernel descent method effectively estimates next tokens based on past observations.
Experimental results support the theoretical analysis and suggest broader applicability.
Abstract
Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token given an autoregressive sequence as a prompt, where , and is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when is linear or when is periodic. We explicitly construct a Transformer (with linear, exponential, or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Topic Modeling · Computational Physics and Python Applications
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
