When is Warmstarting Effective for Scaling Language Models?
Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina Rakotoarison, Josif Grabocka, Frank Hutter, Aaron Klein

TL;DR
This paper investigates the effectiveness of warmstarting in large-scale language model training, revealing that simple growth strategies can outperform complex methods and identifying an optimal growth factor for efficiency.
Contribution
It demonstrates that preserving initial performance isn't necessary, introduces architecture-agnostic growth strategies, and empirically defines an optimal growth factor for training efficiency.
Findings
A 2x growth factor yields the best speedups in most scenarios.
Training from scratch is more efficient beyond a certain growth factor.
Growth strategies outperform complex warmstarting operators.
Abstract
Model growth from a given checkpoint aims to accelerate training of a larger model, offering potential resource savings. Despite recent interest, warmstarting has seen limited practical adoption in large-scale training. We attribute this to two underexplored factors: (1) an overemphasis on preserving the smaller model's performance at initialization, which constrains operator design for new architectures, and (2) insufficient analysis of how growth interacts with hyperparameters and scaling behavior, compounded by inconsistent growth factors across the literature. We show that preserving the base model's initial post-growth performance is not necessary for strong final performance, and that simple, architecture-agnostic growth strategies can outperform more complex warmstarting operators. Crucially, we empirically identify an upper bound on the growth factor beyond which training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
