On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer
Ruihan Xu, Jiajin Li, Yiping Lu

TL;DR
This paper introduces a new class of width-independent optimizers for deep neural networks based on mean-normalized matrix operator norms, enabling stable training and effective learning-rate transfer across different model widths.
Contribution
It proposes mean-normalized operator norms that ensure width-independent control of optimizer behavior and introduces MOGA, a practical width-aware optimizer for large-scale pre-training.
Findings
MOGA achieves stable training across various model widths.
Row normalization in MOGA improves training speed and stability.
Width-independent smoothness guarantees enhance optimizer robustness.
Abstract
A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted , that admit layerwise composability, yield width-independent smoothness bounds, and give…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Machine Learning and Data Classification
