Scaling Law with Learning Rate Annealing

Howe Tissue; Venus Wang; Lu Wang

arXiv:2408.11029·cs.CL·October 28, 2024

Scaling Law with Learning Rate Annealing

Howe Tissue, Venus Wang, Lu Wang

PDF

Open Access

TL;DR

This paper introduces a scaling law model incorporating learning rate annealing to accurately predict neural language model loss curves, reducing computational costs and enhancing understanding of training dynamics.

Contribution

It presents a novel scaling law formulation that accounts for learning rate schedules, enabling precise loss prediction across training steps with minimal data.

Findings

01

The scaling law accurately fits loss curves across various models and hyperparameters.

02

It can predict loss at any training step using limited training data.

03

The formulation extends to model size effects and explains empirical observations.

Abstract

We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps: $L (s) = L_{0} + A \cdot S_{1}^{- α} - C \cdot S_{2},$ where $L (s)$ is the validation loss at step $s$ , $S_{1}$ is the area under the LR curve, $S_{2}$ is the LR annealing area, and $L_{0}$ , $A$ , $C$ , $α$ are constant parameters. This formulation takes into account two factors: (1) power-law scaling over data size, and (2) the additional loss reduction during LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step across any learning rate scheduler (LRS). This approach significantly reduces computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsChinchilla