The pitfalls of next-token prediction

Gregor Bachmann; Vaishnavh Nagarajan

arXiv:2403.06963·cs.CL·July 30, 2025·2 cites

The pitfalls of next-token prediction

Gregor Bachmann, Vaishnavh Nagarajan

PDF

Open Access 1 Repo 2 Videos

TL;DR

This paper critically examines the limitations of next-token prediction models, highlighting how teacher-forcing can fail and proposing a teacherless training approach to overcome these issues, with implications for modeling human intelligence.

Contribution

It clarifies the distinction between inference and training in next-token models, exposes failure modes of teacher-forcing, and introduces a teacherless training method to address these failures.

Findings

01

Teacher-forcing can fail to learn accurate next-token predictors in certain tasks.

02

Both Transformer and Mamba architectures empirically fail under specific conditions.

03

Teacherless training with dummy tokens can mitigate the identified failures.

Abstract

Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gregorbachmann/next-token-failures
pytorchOfficial

Videos

The Pitfalls of Next-token Prediction· youtube

The Pitfalls of Next-Token Prediction· slideslive

Taxonomy

TopicsScientific Computing and Data Management

MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing