Characterizing Learning Curves During Language Model Pre-Training:   Learning, Forgetting, and Stability

Tyler A. Chang; Zhuowen Tu; Benjamin K. Bergen

arXiv:2308.15419·cs.CL·August 1, 2024·1 cites

Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

Tyler A. Chang, Zhuowen Tu, Benjamin K. Bergen

PDF

Open Access 1 Repo

TL;DR

This paper analyzes how language models learn during pre-training, revealing patterns of learning, forgetting, and stability across tokens and contexts, highlighting the progression from n-gram learning to refined predictions.

Contribution

It provides a detailed empirical characterization of learning dynamics, including fluctuations and stability, during language model pre-training across multiple runs.

Findings

01

Frequent tokens are learned earlier and more stably.

02

Learning involves early n-gram acquisition followed by gradual refinement.

03

Token loss fluctuations are consistent across pre-training runs.

Abstract

How do language models learn to make predictions during pre-training? To study this, we extract learning curves from five autoregressive English language model pre-training runs, for 1M unseen tokens in context. We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text. We also find that individual tokens often exhibit sudden increases or decreases in loss that are surprisingly consistent across pre-training runs. To better understand these fluctuations, we quantify the final surprisal, within-run variability, age of acquisition, forgettability, and cross-run variability of learning curves for individual tokens in context. More frequent tokens reach lower final surprisals, exhibit less variability within and across pre-training runs, are learned earlier, and are less likely to be "forgotten" during pre-training.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tylerachang/lm-learning-curves
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification