Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability
Tyler A. Chang, Zhuowen Tu, Benjamin K. Bergen

TL;DR
This paper analyzes how language models learn during pre-training, revealing patterns of learning, forgetting, and stability across tokens and contexts, highlighting the progression from n-gram learning to refined predictions.
Contribution
It provides a detailed empirical characterization of learning dynamics, including fluctuations and stability, during language model pre-training across multiple runs.
Findings
Frequent tokens are learned earlier and more stably.
Learning involves early n-gram acquisition followed by gradual refinement.
Token loss fluctuations are consistent across pre-training runs.
Abstract
How do language models learn to make predictions during pre-training? To study this, we extract learning curves from five autoregressive English language model pre-training runs, for 1M unseen tokens in context. We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text. We also find that individual tokens often exhibit sudden increases or decreases in loss that are surprisingly consistent across pre-training runs. To better understand these fluctuations, we quantify the final surprisal, within-run variability, age of acquisition, forgettability, and cross-run variability of learning curves for individual tokens in context. More frequent tokens reach lower final surprisals, exhibit less variability within and across pre-training runs, are learned earlier, and are less likely to be "forgotten" during pre-training.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
