Auto-Regressive Next-Token Predictors are Universal Learners
Eran Malach

TL;DR
This paper presents a theoretical framework showing that simple auto-regressive next-token predictors, like linear models, can approximate complex functions and exhibit advanced reasoning abilities, highlighting the fundamental power of the next-token training scheme.
Contribution
The work introduces a new complexity measure called length complexity and demonstrates that simple models trained on next-token prediction can perform complex reasoning tasks.
Findings
Linear next-token predictors can approximate Turing-complete functions.
Simple models show non-trivial performance on reasoning and arithmetic tasks.
The power of large language models largely stems from the auto-regressive training scheme.
Abstract
Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer…
Peer Reviews
Decision·ICML 2024 Poster
- Provides an interesting angle on the success of chain-of-thought in enabling LMs to perform more complex tasks - provides both theoretical analysis and empirical evidence
- The TinyStories experiment (Section 3.1) lacks quantitative evaluation. It is unclear if the examples shown are representative. - Multiplication experiment: unlike the TinyStories experiment, the model is not linear -- is this important? Why not use the same model for both experiments? - There appears to be a potential mismatch with realistic chain-of-thought prompting in that the learnability theory developed here assumes that the tasks are available together with their full sequences of in
(1) Provides an elegant theoretical framework for studying auto-regressive next-token prediction models, an important class of models in NLP. (2) Establishes strong learnability and approximation guarantees for simple models like linear predictors when trained auto-regressively. (3) Introduces the novel concept of "length complexity" to capture chain-of-thought requirements. Relates length complexity to sample and computational complexity.
(1) The theoretical results rely on very strong assumptions about availability of chain-of-thought training data, which may be unrealistic. (2) More analysis would be useful on how length complexity scales with problem complexity for different hypothesis classes. (3) Additional validation on more complex architectures like Transformers would strengthen the conclusions about training scheme vs architecture. (4) The proposed linear models are not exactly equivalent to the classical linear model
- This paper provides a theoretical foundation for an increasingly important topic, namely, understanding the emergent abilities of autoregressive learners. - Several theoretical works have already tackled the questions of autoregressive learning. However, to my knowledge, this is the first paper to propose a generic defintion analogous to PAC-learning. This is significant because it may inspire learning theorists to search for analogous results to what is known in the rich literature of PAC. Fo
Listed in decreasing order of significance. ## TinyStories experiment is anecdotal, compares to wrong model? I am not sure what are the actual results being reported with the TinyStories experiment: What I found is a footnote on a 1.2 difference in perplexity between the linear predictor and GPT-2 Small---but I'm not sure what to make of this quantity. And there is a statement that the linear predictor "often does produce coherent text". But how often? And how do you measure coherence? While I
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
