How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo; Zhenbo Sun; Haodong Wen; Xinyu Shi; Jiarui Cui; Chenyi Dang; Kaifeng Lyu; Wenguang Chen

arXiv:2511.18903·cs.LG·May 15, 2026

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

PDF

1 Models 1 Video

TL;DR

This paper investigates how learning rate decay schedules interfere with curriculum-based pretraining of large language models, and proposes strategies to improve their synergy, leading to better benchmark performance.

Contribution

It identifies the incompatibility between data quality ordering and learning rate decay in curriculum pretraining, and offers simple mitigation strategies to enhance performance.

Findings

01

Curriculum pretraining outperforms random shuffling with constant LR.

02

Standard LR decay diminishes the benefits of curriculum pretraining.

03

Moderate LR decay and checkpoint averaging improve benchmark scores.

Abstract

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
thu-pacman/PCMind-2.1-Kaiyuan-2B
model· 14 dl· ♡ 29
14 dl♡ 29

Videos

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining· slideslive