TL;DR
This paper derives scaling laws for hyperparameters like weight decay and batch size in large language model pre-training, enabling better prediction and optimization of training settings as models and datasets grow.
Contribution
It introduces precise power law formulas for scaling hyperparameters such as weight decay and batch size, improving hyperparameter tuning for large-scale language model training.
Findings
Optimal weight decay scales linearly with batch size.
Optimal batch size and critical batch size scale as power laws in dataset size.
Scaling laws enable accurate prediction of hyperparameters before large-scale training.
Abstract
Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate and weight decay . We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, , should remain constant across training settings, and we verify the implication that optimal scales linearly with B, for a fixed N and D. However, as N and D scale, we show optimal obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsAdamW · Weight Decay
