Weight Decay may matter more than muP for Learning Rate Transfer in Practice
Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, Xi Chen

TL;DR
This paper investigates the factors influencing learning rate transfer in neural networks, revealing that weight decay plays a more crucial role than muP scaling in stabilizing training dynamics across different model sizes.
Contribution
The study challenges the prevailing emphasis on muP scaling by demonstrating the dominant role of weight decay in learning rate transfer during neural network training.
Findings
Weight decay stabilizes internal representations across widths.
muP acts mainly as an implicit warmup rather than a scaling rule.
Modified warmup schedules can replace muP scaling effectively.
Abstract
Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of muP rely on strong assumptions, particularly about the geometric alignment of a layer's inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing
