TL;DR
This paper introduces a unified spectral framework for maximal update parameterization ($$P) under joint width and depth scaling, improving stability and hyperparameter transfer for deep residual networks like Transformers.
Contribution
It develops a simple, unified spectral approach for $$P in joint width-depth scaling, unifies previous formulations, and extends $$P to various optimizers with practical benefits.
Findings
The $k\u2265 2$ case of residual blocks is more suitable for practical architectures.
The spectral framework enables stable feature learning and hyperparameter transfer.
Experiments on GPT-2 style models confirm the effectiveness of the $k\u2265 2$ $$P formulation.
Abstract
Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization (P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for P under joint width-depth scaling. For deep residual networks whose residual blocks contain transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from to , unifying previously disparate P formulations and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
