Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Minhak Song; Beomhan Baek; Kwangjun Ahn; Chulhee Yun

arXiv:2507.09846·cs.LG·November 4, 2025

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that Schedule-Free (SF) methods, particularly SF-AdamW, effectively train large language models by implicitly performing weight averaging without decay phases or extra memory, offering a scalable and theoretically grounded alternative to traditional schedules.

Contribution

The paper introduces a refined SF-AdamW variant that improves robustness and scalability, providing both empirical and theoretical insights into its dynamics for language model training.

Findings

01

SF-AdamW navigates the loss landscape without decay phases.

02

It implicitly performs weight averaging without additional memory.

03

The refined variant outperforms original SF in robustness and large batch training.

Abstract

As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training· slideslive

Taxonomy

TopicsNatural Language Processing Techniques