Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Tomer Porian; Mitchell Wortsman; Jenia Jitsev; Ludwig Schmidt; Yair; Carmon

arXiv:2406.19146·cs.LG·January 22, 2025·1 cites

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair, Carmon

PDF

Open Access 3 Repos 1 Models

TL;DR

This paper reconciles two influential scaling laws for language models by identifying key factors causing discrepancies and demonstrates that with corrections, the laws align well, also deriving related optimal hyperparameter scaling laws.

Contribution

It explains the differences between existing scaling laws and provides corrected, unified laws for model size, learning rate, and batch size based on compute budget.

Findings

01

Corrected scaling laws agree with Hoffmann et al.

02

Careful learning rate decay is not necessary for the scaling law validity.

03

Tuning AdamW β2 is crucial at lower batch sizes.

Abstract

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $β_{2}$ parameter is essential at lower batch sizes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
formll/resolving-scaling-law-discrepancies
model· ♡ 5
♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsAdamW