Provable Long-Range Benefits of Next-Token Prediction
Xinyuan Cao, Santosh S. Vempala

TL;DR
This paper proves that next-token prediction with neural networks, especially RNNs, can theoretically learn and generate long-range coherent text, explaining the practical success of language models.
Contribution
It provides a formal proof that next-token prediction enables neural networks to capture long-range structure, with bounds on model size for long-range indistinguishability.
Findings
Next-token prediction approximates the training distribution for long-range text.
Polynomial bounds on model size ensure long-range coherence.
Theoretical explanation for the effectiveness of language models in capturing long-range dependencies.
Abstract
Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next tokens, for any , can distinguish between consecutive tokens of such documents and tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in , independent of the document length) on the model size needed to achieve such -token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science
