Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Dayal Singh Kalra, Maissam Barkeshli

TL;DR
This paper introduces a framework to quantify hyperparameter transfer in large language models, highlighting the significance of embedding layer learning rate and the Maximal Update parameterization for improved transferability.
Contribution
It develops metrics to evaluate hyperparameter transfer quality and explains why Maximal Update parameterization enhances transferability by focusing on embedding layer learning rate.
Findings
Maximal Update parameterization improves hyperparameter transfer over standard parameterization.
Increasing embedding layer learning rate stabilizes training and enhances transfer.
Weight decay improves scaling law fit but can reduce robustness in extrapolation.
Abstract
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update (P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of P…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
