Scaling and Transferability of Annealing Strategies in Large Language Model Training
Siqi Wang, Zhengyu Chen, Teng Xiao, Zheqi Lv, Jinluan Yang, Xunliang Cai, Jingang Wang, Xiaomeng Li

TL;DR
This paper explores how annealing strategies in large language model training can be transferred and optimized using a generalized framework, reducing the need for exhaustive hyperparameter tuning.
Contribution
It introduces an improved predictive framework for annealing strategies that accounts for training steps, maximum learning rate, and annealing behavior, enabling better transferability.
Findings
Smaller models can reliably proxy larger model training dynamics.
Optimal annealing ratios follow consistent patterns across models.
Transferable annealing strategies improve training efficiency.
Abstract
Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
