Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points
Aditya Varre, Gizem Y\"uce, Nicolas Flammarion

TL;DR
This paper analyzes the loss landscape of transformer models trained on in-context n-gram prediction, revealing that sub-n-grams are near-stationary points and explaining stage-wise learning phenomena.
Contribution
It provides a theoretical framework showing sub-n-grams are near-stationary points in the loss landscape, explaining stage-wise learning and phase transitions in transformers.
Findings
Sub-n-grams are near-stationary points of the population loss.
Stage-wise learning dynamics are explained by transitions between these points.
Numerical experiments support the theoretical insights.
Abstract
Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context -gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent -gram estimators (for ), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub--grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
