Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun

TL;DR
This paper demonstrates that Schedule-Free (SF) methods, particularly SF-AdamW, effectively train large language models by implicitly performing weight averaging without decay phases or extra memory, offering a scalable and theoretically grounded alternative to traditional schedules.
Contribution
The paper introduces a refined SF-AdamW variant that improves robustness and scalability, providing both empirical and theoretical insights into its dynamics for language model training.
Findings
SF-AdamW navigates the loss landscape without decay phases.
It implicitly performs weight averaging without additional memory.
The refined variant outperforms original SF in robustness and large batch training.
Abstract
As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
