Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification
Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, Dylan, J. Foster

TL;DR
This paper investigates the fundamental limits of next-token prediction under model misspecification, revealing inherent computational and statistical tradeoffs that affect error amplification as sequence length increases.
Contribution
It provides a theoretical analysis of error amplification in misspecified models, establishing bounds and tradeoffs for achieving robust next-token prediction.
Findings
Error amplification grows with sequence length under misspecification.
Information-theoretic methods can avoid error amplification, achieving constant approximation.
Computational constraints impose a polynomial or sub-exponential barrier on achievable accuracy.
Abstract
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification -- where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor -- we confirm that indeed grows with for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
