Provable Long-Range Benefits of Next-Token Prediction

Xinyuan Cao; Santosh S. Vempala

arXiv:2512.07818·cs.LG·December 9, 2025

Provable Long-Range Benefits of Next-Token Prediction

Xinyuan Cao, Santosh S. Vempala

PDF

Open Access

TL;DR

This paper proves that next-token prediction with neural networks, especially RNNs, can theoretically learn and generate long-range coherent text, explaining the practical success of language models.

Contribution

It provides a formal proof that next-token prediction enables neural networks to capture long-range structure, with bounds on model size for long-range indistinguishability.

Findings

01

Next-token prediction approximates the training distribution for long-range text.

02

Polynomial bounds on model size ensure long-range coherence.

03

Theoretical explanation for the effectiveness of language models in capturing long-range dependencies.

Abstract

Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$ , can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$ , independent of the document length) on the model size needed to achieve such $k$ -token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science