A Proof of Learning Rate Transfer under $\mu$P
Soufiane Hayou

TL;DR
This paper proves that in $$-parametrized neural networks, the optimal learning rate remains non-zero as width increases, explaining learning rate transfer, unlike other parametrizations.
Contribution
It provides the first proof of learning rate transfer under $$-parametrization in wide neural networks, supported by empirical validation.
Findings
Optimal learning rate converges to a non-zero constant under $$-parametrization.
Learning rate transfer fails under Standard and Neural Tangent Parametrizations.
Theoretical results are supported by extensive empirical experiments.
Abstract
We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under , the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
