Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning
Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki

TL;DR
Pre-training large language models with a constant learning rate after warmup (WSO) improves their adaptability and performance in downstream tasks compared to decay-based learning rate schedulers.
Contribution
This study demonstrates that avoiding learning rate decay during pre-training leads to models with flatter minima and better downstream performance, challenging conventional training practices.
Findings
WSO outperforms decay-based schedulers after supervised fine-tuning
Models trained with WSO have flatter loss landscape minima
Avoiding LR decay enhances downstream task adaptability
Abstract
We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper examines the common practice of using learning-rate decay in LLM pre-training. The paper provides empirical evidence that keeping a constant learning rate after warmup improves performance. This approach, called the Warmup-Stable-Only (WSO) scheduler, outperforms conventional decay-based schedulers in supervised fine-tuning. Their finding highlights practical effectiveness for optimizing the entire LLM training pipeline. 2. The paper demonstrates the inversion effect between pre-tra
1. The experiments are restricted to 1B and 8B parameters, which are relatively small compared to state-of-the-art deployed LLMs (often 30B~70B+). The absence of results at larger scales limits confidence in whether the observed advantages of WSO would extend to all situations. 2. The study evaluates WSO against only three decay-based schedulers (Cosine, Linear, and Warmup-Stable-Decay). Other commonly used or recently explored learning rate strategies, such as polynomial decay or cyclic policie
1. This paper presents WSO, a very simple and intuitive LR schedule to improve SFT on downstream performance. 2. This paper provides a practical reference for LLM pre-training community to design LR schedule from a global training perspective.
The primary concern with this paper is that the proposed approach—while effective—has been extensively discussed, implemented, and validated in prior work, without introducing significant novelty. Furthermore, the absence of references to these existing studies raises questions about the thoroughness of the literature review. 1. In the original WSD paper [(https://arxiv.org/pdf/2404.06395)](https://arxiv.org/pdf/2404.06395), the authors already demonstrated the benefits of switching to high-qua
The paper is exceptionally clear, well-written, and easy to follow. The central conclusion—that pre-training without LR decay enhances SFT performance—is simple, impactful, and supported by extensive evidence. The experiments are comprehensive, covering multiple model scales (1B and 8B), different training pipelines (two-stage and three-stage with mid-training), and modern training regimes (over-training). This work has significant practical implications for the industry. The WSO scheduler is
The primary weakness, though minor, is that the investigation of downstream performance is limited to SFT. The paper does not explore other critical post-training stages, such as preference tuning (e.g., DPO) or reinforcement learning-based alignment. It remains an open question whether the significant benefits of WSO pre-training persist or behave differently in these other alignment scenarios. I don't think this would be an issue as the title also constrains the scope to SFT.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Machine Learning and Data Classification
