Implicit Optimization Bias of Next-Token Prediction in Linear Models

Christos Thrampoulidis

arXiv:2402.18551·cs.LG·November 1, 2024·1 cites

Implicit Optimization Bias of Next-Token Prediction in Linear Models

Christos Thrampoulidis

PDF

Open Access

TL;DR

This paper investigates the optimization biases of next-token prediction in linear models, revealing how gradient descent selects solutions that balance data entropy and margin maximization, with implications for understanding language model training.

Contribution

It introduces NTP-separability conditions and characterizes the implicit bias of gradient descent in linear models for next-token prediction, extending prior bias analyses to this setting.

Findings

01

GD aligns logits' differences with log-odds in data subspace

02

GD diverges in orthogonal subspace, maximizing NTP margin

03

Results extend implicit bias understanding from one-hot classification to NTP

Abstract

We initiate an investigation into the optimization properties of next-token prediction (NTP), the dominant training paradigm for modern language models. Specifically, we study the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across distinct contexts, each tied with a sparse conditional probability distribution across a finite vocabulary of tokens, we introduce "NTP-separability conditions" that enable reaching the data-entropy lower bound. With this setup, and focusing on linear models with fixed context embeddings, we characterize the optimization bias of gradient descent (GD): Within the data subspace defined by the sparsity patterns of distinct contexts, GD selects parameters that equate the logits' differences of in-support tokens to their log-odds.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)