Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics
Mohamed Elgaar, Hadi Amiri

TL;DR
This paper investigates how different linguistically motivated curriculum learning strategies influence the training dynamics of large language models, revealing that curricula mainly affect the duration of latent training phases and stability.
Contribution
It provides a detailed analysis of training phases, gradient noise, and output structure under various curricula, highlighting the impact of data ordering on training stability and phase durations.
Findings
Curricula mainly change time spent in each latent training phase.
Random ordering results in higher gradient noise and output-head saturation.
Descending order Verb Variation curriculum reduces accuracy gains compared to ascending order.
Abstract
Curriculum learning changes the order of pretraining data, but it remains unclear how ordering changes the learning dynamics. We pretrain models from 14M to 1B parameters for 300B tokens under three linguistically motivated curricula--Age-of-Acquisition, word frequency, and Verb Variation (VV)--and compare each against Random ordering. We analyze latent training phases, gradient noise scale (GNS), and the singular-value structure of the output head. We find that training follows a shared sequence of latent phases, while curricula mainly change time spent in each phase. Random ordering yields higher GNS at 14M-70M and late singular-entropy spikes up to 160M, consistent with noisier gradients and output-head saturation. A reverse-order VV control shows that direction matters: descending order loses much of the accuracy advantage of the ascending curriculum. At larger scales, these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
