TL;DR
This paper investigates how learning rate decay schedules interfere with curriculum-based pretraining of large language models, and proposes strategies to improve their synergy, leading to better benchmark performance.
Contribution
It identifies the incompatibility between data quality ordering and learning rate decay in curriculum pretraining, and offers simple mitigation strategies to enhance performance.
Findings
Curriculum pretraining outperforms random shuffling with constant LR.
Standard LR decay diminishes the benefits of curriculum pretraining.
Moderate LR decay and checkpoint averaging improve benchmark scores.
Abstract
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
