Training Trajectories of Language Models Across Scales
Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth, Pasunuru, Danqi Chen, Luke Zettlemoyer, Ves Stoyanov

TL;DR
This paper investigates how language models of various sizes learn during pre-training, revealing that perplexity predicts in-context learning performance better than model size or training steps, and analyzing training dynamics across scales.
Contribution
The study provides a detailed analysis of training trajectories across different model sizes, highlighting perplexity's role in predicting model behavior and learning patterns during pre-training.
Findings
Similar subsets of tokens reduce loss across models at a given perplexity.
All models initially learn to reduce hallucination-related perplexity.
Perplexity strongly predicts in-context learning performance regardless of model size.
Abstract
Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsOPT
