Implicit Optimization Bias of Next-Token Prediction in Linear Models
Christos Thrampoulidis

TL;DR
This paper investigates the optimization biases of next-token prediction in linear models, revealing how gradient descent selects solutions that balance data entropy and margin maximization, with implications for understanding language model training.
Contribution
It introduces NTP-separability conditions and characterizes the implicit bias of gradient descent in linear models for next-token prediction, extending prior bias analyses to this setting.
Findings
GD aligns logits' differences with log-odds in data subspace
GD diverges in orthogonal subspace, maximizing NTP margin
Results extend implicit bias understanding from one-hot classification to NTP
Abstract
We initiate an investigation into the optimization properties of next-token prediction (NTP), the dominant training paradigm for modern language models. Specifically, we study the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across distinct contexts, each tied with a sparse conditional probability distribution across a finite vocabulary of tokens, we introduce "NTP-separability conditions" that enable reaching the data-entropy lower bound. With this setup, and focusing on linear models with fixed context embeddings, we characterize the optimization bias of gradient descent (GD): Within the data subspace defined by the sparsity patterns of distinct contexts, GD selects parameters that equate the logits' differences of in-support tokens to their log-odds.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)
