
TL;DR
This paper introduces Newton-Muon, a new optimizer inspired by a surrogate quadratic model that improves training efficiency for large language models, reducing iteration steps and training time.
Contribution
It derives a closed-form update rule for a new optimizer, Newton-Muon, providing a novel interpretation of Muon as an implicit Newton-type method.
Findings
Newton-Muon reaches target validation loss 6% faster in iterations.
Reduces wall-clock training time by approximately 4%.
Offers a new theoretical perspective on Muon optimizer design.
Abstract
The Muon optimizer has received considerable attention for its strong performance in training large language models, yet the design principle behind its matrix-gradient orthogonalization remains largely elusive. In this paper, we introduce a surrogate model that not only sheds new light on the design of Muon, but more importantly leads to a new optimizer. In the same spirit as the derivation of Newton's method, the surrogate approximates the loss as a quadratic function of the perturbation to a weight matrix using only three matrices: the gradient , an output-space curvature matrix , and the data matrix that stacks the layer inputs. By minimizing this surrogate in one step and adopting a certain isotropic assumption on the weights, we obtain the closed-form update rule (up to momentum and weight decay) , where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
