Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

Shane Bergsma; Nolan Dey; Gurpreet Gosal; Gavia Gray; Daria Soboleva; Joel Hestness

arXiv:2505.13738·cs.LG·November 25, 2025

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, Joel Hestness

PDF

1 Video

TL;DR

This paper derives scaling laws for hyperparameters like weight decay and batch size in large language model pre-training, enabling better prediction and optimization of training settings as models and datasets grow.

Contribution

It introduces precise power law formulas for scaling hyperparameters such as weight decay and batch size, improving hyperparameter tuning for large-scale language model training.

Findings

01

Optimal weight decay scales linearly with batch size.

02

Optimal batch size and critical batch size scale as power laws in dataset size.

03

Scaling laws enable accurate prediction of hyperparameters before large-scale training.

Abstract

Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate $η$ and weight decay $λ$ . We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, $τ = B / (η λ D)$ , should remain constant across training settings, and we verify the implication that optimal $λ$ scales linearly with B, for a fixed N and D. However, as N and D scale, we show optimal $τ$ obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict $λ$ opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Power Lines: Scaling laws for weight decay and batch size in LLM pre-training· slideslive

Taxonomy

MethodsAdamW · Weight Decay