Spectral Condition for $\mu$P under Width-Depth Scaling

Chenyu Zheng; Rongzhen Wang; Xinyu Zhang; Chongxuan Li

arXiv:2603.00541·cs.LG·May 12, 2026

Spectral Condition for $\mu$P under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, Chongxuan Li

PDF

1 Repo

TL;DR

This paper introduces a unified spectral framework for maximal update parameterization ($$P) under joint width and depth scaling, improving stability and hyperparameter transfer for deep residual networks like Transformers.

Contribution

It develops a simple, unified spectral approach for $$P in joint width-depth scaling, unifies previous formulations, and extends $$P to various optimizers with practical benefits.

Findings

01

The $k\u2265 2$ case of residual blocks is more suitable for practical architectures.

02

The spectral framework enables stable feature learning and hyperparameter transfer.

03

Experiments on GPT-2 style models confirm the effectiveness of the $k\u2265 2$ $$P formulation.

Abstract

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ( $μ$ P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$ P under joint width-depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k = 1$ to $k \geq 2$ , unifying previously disparate $μ$ P formulations and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ml-gsai/Width-Depth-muP
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.