On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Ruihan Xu; Jiajin Li; Yiping Lu

arXiv:2603.09952·cs.LG·March 11, 2026

On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Ruihan Xu, Jiajin Li, Yiping Lu

PDF

Open Access

TL;DR

This paper introduces a new class of width-independent optimizers for deep neural networks based on mean-normalized matrix operator norms, enabling stable training and effective learning-rate transfer across different model widths.

Contribution

It proposes mean-normalized operator norms that ensure width-independent control of optimizer behavior and introduces MOGA, a practical width-aware optimizer for large-scale pre-training.

Findings

01

MOGA achieves stable training across various model widths.

02

Row normalization in MOGA improves training speed and stability.

03

Width-independent smoothness guarantees enhance optimizer robustness.

Abstract

A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width $w$ increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard $p \to q$ operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted $\pmean \to \qmean$ , that admit layerwise composability, yield width-independent smoothness bounds, and give…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Machine Learning and Data Classification