A Proof of Learning Rate Transfer under $\mu$P

Soufiane Hayou

arXiv:2511.01734·stat.ML·February 26, 2026

A Proof of Learning Rate Transfer under $\mu$P

Soufiane Hayou

PDF

Open Access

TL;DR

This paper proves that in $$-parametrized neural networks, the optimal learning rate remains non-zero as width increases, explaining learning rate transfer, unlike other parametrizations.

Contribution

It provides the first proof of learning rate transfer under $$-parametrization in wide neural networks, supported by empirical validation.

Findings

01

Optimal learning rate converges to a non-zero constant under $$-parametrization.

02

Learning rate transfer fails under Standard and Neural Tangent Parametrizations.

03

Theoretical results are supported by extensive empirical experiments.

Abstract

We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$ P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $μ P$ , the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM