The pitfalls of next-token prediction
Gregor Bachmann, Vaishnavh Nagarajan

TL;DR
This paper critically examines the limitations of next-token prediction models, highlighting how teacher-forcing can fail and proposing a teacherless training approach to overcome these issues, with implications for modeling human intelligence.
Contribution
It clarifies the distinction between inference and training in next-token models, exposes failure modes of teacher-forcing, and introduces a teacherless training method to address these failures.
Findings
Teacher-forcing can fail to learn accurate next-token predictors in certain tasks.
Both Transformer and Mamba architectures empirically fail under specific conditions.
Teacherless training with dummy tokens can mitigate the identified failures.
Abstract
Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScientific Computing and Data Management
MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing
