Next-token prediction capacity: general upper bounds and a lower bound for transformers
Liam Madden, Curtis Fox, Christos Thrampoulidis

TL;DR
This paper establishes theoretical upper and lower bounds on the capacity of decoder-only transformers for next-token prediction, providing insights into their memorization limits and properties.
Contribution
It introduces the first bounds on the number of context sequences a transformer can interpolate, highlighting an injectivity property of self-attention.
Findings
Bounds are equal up to a constant factor.
Minimal parameters for memorization match entropy lower bound.
Numerical evidence supports the theoretical capacity limits.
Abstract
Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer multi-head decoder-only transformers and our proofs highlight an important injectivity property satisfied by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques · Semiconductor materials and devices · Machine Learning in Materials Science
