Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

Kazuki Yano; Shun Kiyono; Sosuke Kobayashi; Sho Takase; Jun Suzuki

arXiv:2603.16127·cs.CL·March 18, 2026

Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki

PDF

Open Access 3 Reviews

TL;DR

Pre-training large language models with a constant learning rate after warmup (WSO) improves their adaptability and performance in downstream tasks compared to decay-based learning rate schedulers.

Contribution

This study demonstrates that avoiding learning rate decay during pre-training leads to models with flatter minima and better downstream performance, challenging conventional training practices.

Findings

01

WSO outperforms decay-based schedulers after supervised fine-tuning

02

Models trained with WSO have flatter loss landscape minima

03

Avoiding LR decay enhances downstream task adaptability

Abstract

We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper examines the common practice of using learning-rate decay in LLM pre-training. The paper provides empirical evidence that keeping a constant learning rate after warmup improves performance. This approach, called the Warmup-Stable-Only (WSO) scheduler, outperforms conventional decay-based schedulers in supervised fine-tuning. Their finding highlights practical effectiveness for optimizing the entire LLM training pipeline. 2. The paper demonstrates the inversion effect between pre-tra

Weaknesses

1. The experiments are restricted to 1B and 8B parameters, which are relatively small compared to state-of-the-art deployed LLMs (often 30B~70B+). The absence of results at larger scales limits confidence in whether the observed advantages of WSO would extend to all situations. 2. The study evaluates WSO against only three decay-based schedulers (Cosine, Linear, and Warmup-Stable-Decay). Other commonly used or recently explored learning rate strategies, such as polynomial decay or cyclic policie

Reviewer 02Rating 2Confidence 5

Strengths

1. This paper presents WSO, a very simple and intuitive LR schedule to improve SFT on downstream performance. 2. This paper provides a practical reference for LLM pre-training community to design LR schedule from a global training perspective.

Weaknesses

The primary concern with this paper is that the proposed approach—while effective—has been extensively discussed, implemented, and validated in prior work, without introducing significant novelty. Furthermore, the absence of references to these existing studies raises questions about the thoroughness of the literature review. 1. In the original WSD paper [(https://arxiv.org/pdf/2404.06395)](https://arxiv.org/pdf/2404.06395), the authors already demonstrated the benefits of switching to high-qua

Reviewer 03Rating 8Confidence 4

Strengths

The paper is exceptionally clear, well-written, and easy to follow. The central conclusion—that pre-training without LR decay enhances SFT performance—is simple, impactful, and supported by extensive evidence. The experiments are comprehensive, covering multiple model scales (1B and 8B), different training pipelines (two-stage and three-stage with mid-training), and modern training regimes (over-training). This work has significant practical implications for the industry. The WSO scheduler is

Weaknesses

The primary weakness, though minor, is that the investigation of downstream performance is limited to SFT. The paper does not explore other critical post-training stages, such as preference tuning (e.g., DPO) or reinforcement learning-based alignment. It remains an open question whether the significant benefits of WSO pre-training persist or behave differently in these other alignment scenarios. I don't think this would be an issue as the title also constrains the scope to SFT.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Machine Learning and Data Classification