Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Alexander H\"agele, Elie Bakouch, Atli Kosson, Loubna Ben Allal,, Leandro Von Werra, Martin Jaggi

TL;DR
This paper explores alternative training schedules to cosine decay, demonstrating predictable scaling behavior and the benefits of stochastic weight averaging, enabling more efficient scaling experiments with reduced compute resources.
Contribution
It introduces a simple, predictable training schedule alternative and shows how to perform scalable experiments efficiently, improving understanding of model scaling beyond fixed training durations.
Findings
Constant learning rate with cooldowns scales predictably.
Stochastic weight averaging improves performance during training.
Scaling experiments can be done with fewer runs and less compute.
Abstract
Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative -- constant learning rate and cooldowns -- and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScheduling and Timetabling Solutions · Simulation Techniques and Applications
MethodsStochastic Weight Averaging
