Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size
Soufiane Hayou, Liyuan Liu

TL;DR
This paper analyzes how vocabulary size influences training dynamics in large language models, revealing a new regime where optimal learning rate scaling differs from previous theories, supported by theoretical insights and empirical validation.
Contribution
It introduces the Large Vocab (LV) regime, extending $ta$P theory to account for large vocabularies, and proposes a new optimal embedding learning rate scaling rule.
Findings
Optimal embedding LR scales as ((width)) in the LV regime.
Theoretical analysis shows interpolation between P and LV regimes.
Empirical validation confirms the new scaling rule improves training efficiency.
Abstract
Pretraining large language models is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On the parametrization side, (Maximal Update Parametrization) parametrizes model weights and learning rate (LR) in a way that makes hyperparameters (HPs) transferable with width (embedding dimension): HPs can be tuned for a small model and used for larger models without additional tuning. While P showed impressive results in practice, recent empirical studies have reported conflicting observations when applied to LLMs. One limitation of the theory behind P is the fact that input dimension (vocabulary size in LLMs) is considered fixed when taking the width to infinity. This is unrealistic since vocabulary size is generally much larger than width in practice. In this work, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Assessment · Natural Language Processing Techniques
