Scaling Law with Learning Rate Annealing
Howe Tissue, Venus Wang, Lu Wang

TL;DR
This paper introduces a scaling law model incorporating learning rate annealing to accurately predict neural language model loss curves, reducing computational costs and enhancing understanding of training dynamics.
Contribution
It presents a novel scaling law formulation that accounts for learning rate schedules, enabling precise loss prediction across training steps with minimal data.
Findings
The scaling law accurately fits loss curves across various models and hyperparameters.
It can predict loss at any training step using limited training data.
The formulation extends to model size effects and explains empirical observations.
Abstract
We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps: where is the validation loss at step , is the area under the LR curve, is the LR annealing area, and , , , are constant parameters. This formulation takes into account two factors: (1) power-law scaling over data size, and (2) the additional loss reduction during LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step across any learning rate scheduler (LRS). This approach significantly reduces computational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsChinchilla
