Learning Dynamics in Continual Pre-Training for Large Language Models
Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng

TL;DR
This paper investigates the learning dynamics of continual pre-training in large language models, deriving a scaling law that predicts loss evolution and helps optimize training strategies for better performance.
Contribution
It introduces a novel CPT scaling law that models the effects of distribution shift and learning rate schedules, enabling better prediction and customization of training hyper-parameters.
Findings
The CPT loss curve transitions between hidden curves, characterized by distribution shift and learning rate effects.
The derived scaling law accurately predicts loss across different datasets and hyper-parameters.
The approach allows for hyper-parameter customization to balance general and domain-specific performance.
Abstract
Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Speech Recognition and Synthesis
MethodsFocus
