Computational-Statistical Tradeoffs at the Next-Token Prediction   Barrier: Autoregressive and Imitation Learning under Misspecification

Dhruv Rohatgi; Adam Block; Audrey Huang; Akshay Krishnamurthy; Dylan; J. Foster

arXiv:2502.12465·cs.LG·February 19, 2025

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, Dylan, J. Foster

PDF

Open Access

TL;DR

This paper investigates the fundamental limits of next-token prediction under model misspecification, revealing inherent computational and statistical tradeoffs that affect error amplification as sequence length increases.

Contribution

It provides a theoretical analysis of error amplification in misspecified models, establishing bounds and tradeoffs for achieving robust next-token prediction.

Findings

01

Error amplification grows with sequence length under misspecification.

02

Information-theoretic methods can avoid error amplification, achieving constant approximation.

03

Computational constraints impose a polynomial or sub-exponential barrier on achievable accuracy.

Abstract

Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification -- where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C \geq 1$ -- we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques