Next-token prediction capacity: general upper bounds and a lower bound for transformers

Liam Madden; Curtis Fox; Christos Thrampoulidis

arXiv:2405.13718·cs.LG·November 25, 2025

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Liam Madden, Curtis Fox, Christos Thrampoulidis

PDF

Open Access 1 Repo

TL;DR

This paper establishes theoretical upper and lower bounds on the capacity of decoder-only transformers for next-token prediction, providing insights into their memorization limits and properties.

Contribution

It introduces the first bounds on the number of context sequences a transformer can interpolate, highlighting an injectivity property of self-attention.

Findings

01

Bounds are equal up to a constant factor.

02

Minimal parameters for memorization match entropy lower bound.

03

Numerical evidence supports the theoretical capacity limits.

Abstract

Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer multi-head decoder-only transformers and our proofs highlight an important injectivity property satisfied by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

curtfox/decoder-memory-capacity
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques · Semiconductor materials and devices · Machine Learning in Materials Science