Training Trajectories of Language Models Across Scales

Mengzhou Xia; Mikel Artetxe; Chunting Zhou; Xi Victoria Lin; Ramakanth; Pasunuru; Danqi Chen; Luke Zettlemoyer; Ves Stoyanov

arXiv:2212.09803·cs.CL·May 31, 2023

Training Trajectories of Language Models Across Scales

Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth, Pasunuru, Danqi Chen, Luke Zettlemoyer, Ves Stoyanov

PDF

Open Access 1 Repo

TL;DR

This paper investigates how language models of various sizes learn during pre-training, revealing that perplexity predicts in-context learning performance better than model size or training steps, and analyzing training dynamics across scales.

Contribution

The study provides a detailed analysis of training trajectories across different model sizes, highlighting perplexity's role in predicting model behavior and learning patterns during pre-training.

Findings

01

Similar subsets of tokens reduce loss across models at a given perplexity.

02

All models initially learn to reduce hallucination-related perplexity.

03

Perplexity strongly predicts in-context learning performance regardless of model size.

Abstract

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiamengzhou/training_trajectory_analysis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsOPT