Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Dayal Singh Kalra; Maissam Barkeshli

arXiv:2605.21486·cs.LG·May 21, 2026

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Dayal Singh Kalra, Maissam Barkeshli

PDF

TL;DR

This paper introduces a framework to quantify hyperparameter transfer in large language models, highlighting the significance of embedding layer learning rate and the Maximal Update parameterization for improved transferability.

Contribution

It develops metrics to evaluate hyperparameter transfer quality and explains why Maximal Update parameterization enhances transferability by focusing on embedding layer learning rate.

Findings

01

Maximal Update parameterization improves hyperparameter transfer over standard parameterization.

02

Increasing embedding layer learning rate stabilizes training and enhances transfer.

03

Weight decay improves scaling law fit but can reduce robustness in extrapolation.

Abstract

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ( $μ$ P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$ P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$ P…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.