TL;DR
LionMuon is a novel optimizer that alternates between spectral and sign descent methods, achieving high efficiency and superior performance across multiple large-scale datasets and models.
Contribution
It introduces LionMuon, an optimizer combining spectral and sign-based updates with reduced memory and computational costs, outperforming existing methods.
Findings
LionMuon outperforms Muon, Lion, Signum, and AdamW on all tested datasets and architectures.
LionMuon achieves lower validation loss at reduced compute cost.
Theoretical analysis provides sharp complexity bounds predicting optimal periods and conditions for superiority.
Abstract
In large-scale optimization, the cheapness and effectiveness of update steps are the most crucial factors for a successful optimizer. Sign-based optimizers like Lion or Signum produce cheap per-step updates, whereas Muon's spectral matrix-sign update gives a much stronger direction at a substantially higher per-step cost. In this work, we propose LionMuon, which retains the effectiveness of Muon steps while considerably cutting the averaged iteration cost, similar to sign-based methods. It alternates between Lion's and Muon's updates on a fixed period P, sharing a single dual-EMA momentum buffer between them. The optimizer state memory therefore matches Lion and is exactly half of AdamW's. A simpler single-EMA variant, SignMuon, by itself already outperforms pure Muon. At P = 2, LionMuon Pareto-dominates Muon, Lion, Signum, and AdamW on every dataset and architecture we tested at 124M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
