Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size

Soufiane Hayou; Liyuan Liu

arXiv:2506.15025·cs.LG·June 19, 2025

Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size

Soufiane Hayou, Liyuan Liu

PDF

Open Access

TL;DR

This paper analyzes how vocabulary size influences training dynamics in large language models, revealing a new regime where optimal learning rate scaling differs from previous theories, supported by theoretical insights and empirical validation.

Contribution

It introduces the Large Vocab (LV) regime, extending $ta$P theory to account for large vocabularies, and proposes a new optimal embedding learning rate scaling rule.

Findings

01

Optimal embedding LR scales as ((width)) in the LV regime.

02

Theoretical analysis shows interpolation between P and LV regimes.

03

Empirical validation confirms the new scaling rule improves training efficiency.

Abstract

Pretraining large language models is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On the parametrization side, $μ P$ (Maximal Update Parametrization) parametrizes model weights and learning rate (LR) in a way that makes hyperparameters (HPs) transferable with width (embedding dimension): HPs can be tuned for a small model and used for larger models without additional tuning. While $μ$ P showed impressive results in practice, recent empirical studies have reported conflicting observations when applied to LLMs. One limitation of the theory behind $μ$ P is the fact that input dimension (vocabulary size in LLMs) is considered fixed when taking the width to infinity. This is unrealistic since vocabulary size is generally much larger than width in practice. In this work, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment · Natural Language Processing Techniques