TL;DR
MuonEq introduces lightweight pre-orthogonalization equilibration schemes to enhance Muon optimizer training, improving convergence and perplexity in large language model pretraining.
Contribution
It proposes a novel, computationally light equilibration method that improves the geometry for orthogonalization, extending theoretical guarantees and demonstrating empirical benefits.
Findings
MuonEq (R) outperforms Muon in LLaMA2 pretraining across multiple model sizes.
Faster convergence and lower validation perplexity observed with MuonEq (R).
Theoretical analysis shows retention of standard nonconvex stationarity guarantees.
Abstract
Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions typically either rescale updates after orthogonalization or use heavier whitening-based preconditioners before it. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon with three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). By rebalancing the momentum matrix before finite-step Newton--Schulz orthogonalization, {\method} improves the geometry seen by orthogonalization. We show that finite-step orthogonalization is governed by the input spectrum, especially stable rank and condition number, and that row/column normalization acts as a zeroth-order surrogate for whitening. For hidden matrix weights, R is the default variant. Theoretically, {\method} (R) retains the standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
