TL;DR
RMNP introduces a computationally efficient preconditioning method for deep learning optimization by replacing Newton-Schulz iteration with row-wise normalization, maintaining performance while reducing complexity.
Contribution
The paper proposes RMNP, a novel optimizer that simplifies preconditioning in neural network training, achieving similar results to Muon with lower computational cost.
Findings
RMNP reduces per-iteration complexity from O(mn·min(m,n)) to O(mn).
RMNP maintains comparable optimization performance to Muon.
Experiments on large language models show RMNP's efficiency and effectiveness.
Abstract
Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, Muon stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of Muon still leaves room for further improvement. In this paper, we introduce RMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise () normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. We empirically verified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
