Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points

Aditya Varre; Gizem Y\"uce; Nicolas Flammarion

arXiv:2508.12837·cs.LG·August 20, 2025

Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points

Aditya Varre, Gizem Y\"uce, Nicolas Flammarion

PDF

Open Access

TL;DR

This paper analyzes the loss landscape of transformer models trained on in-context n-gram prediction, revealing that sub-n-grams are near-stationary points and explaining stage-wise learning phenomena.

Contribution

It provides a theoretical framework showing sub-n-grams are near-stationary points in the loss landscape, explaining stage-wise learning and phase transitions in transformers.

Findings

01

Sub-n-grams are near-stationary points of the population loss.

02

Stage-wise learning dynamics are explained by transitions between these points.

03

Numerical experiments support the theoretical insights.

Abstract

Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$ -gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent $k$ -gram estimators (for $k \leq n$ ), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub- $n$ -grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems