WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training
Changxin Tian, Jiapeng Wang, Qian Zhao, Kunlong Chen, Jia Liu, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou

TL;DR
The paper introduces WSM, a unified framework that replaces traditional learning rate decay with checkpoint merging, leading to improved performance in large language model pre-training and fine-tuning.
Contribution
WSM provides a theoretical foundation linking decay strategies to model merging, emphasizing merge duration as a key factor and outperforming existing decay methods.
Findings
Merge duration significantly impacts model performance.
WSM outperforms traditional decay approaches on multiple benchmarks.
Performance gains are consistent in fine-tuning scenarios.
Abstract
Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint…
Peer Reviews
Decision·ICLR 2026 Oral
1. The experiment in this paper is large-scale and detailed. The models are evaluated over a range of hard benchmarks to show the improvement of the methods. 2. The proposed method is conceptually clean and can be useful in many settings. 3. The theoretical justification of the method, while not rigorous, provides intuitions in guiding the design of the algorithm.
1. In Table 5, the authors show that WSM has a higher ending loss compared to WSD. This seems to be counterintuitive, especially given the higher downstream accuracy. This also brings questions regarding the validity of the correspondence discussed in Section 3.1. 2. Regarding continual pretraining, the authors propose to continual pretrain from the constant learning rate checkpoints before weight merging. It is unclear how this will compare with re-warming up the decay checkpoint in a standard
Overall I really like this paper! It's well written and argued, and the empirical results are quite strong. In terms of significance, this has a non-trivial chance of becoming the standard way people produce final model checkpoints. The theoretical derivation of the equivalence is simple in a wonderful way (though I worry a bit specious). It's of course just algebra, but I have no problem with that. The intuition is the same one that motivates EMA and SWA etc. The empirical results are very s
I have two categories of concerns: ### Experiments don't really demonstrate connection between theory and practice. My biggest concern is that, algebra-aside, the theory is a bit specious. The derivation assumes that gradients would be ~the same in a decay versus non-decay setup, and that can't be true. So, really I think this is "decay-inspired averaging" or "empirical substitute" more than an equivalent/replacement. To that end, there aren't any experiments directly comparing the supposedly
* WSM is a simple method which provides sufficient improvement over WSD and is convenient enough to implement and use that I support the paper's proposal to adopt it instead of WSD. * The experiments are comprehensive.
* This work has limited novelty. The connection between LR schedules and weight averaging is well-known and has more complete treatment elsewhere [1,2]. This includes almost the entirety of Section 3 and forms the main contribution of the paper. Further, the theoretical treatment in this paper is rather hand-wavy. * The proposed methodology is also not surprising, although maybe it hasn't appeared in this packaging before. Weight averaging is a common practice, often applied with annealing [3].
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVLSI and Analog Circuit Testing · Analog and Mixed-Signal Circuit Design · Distributed systems and fault tolerance
