AMUSE: Anytime Muon with Stable Gradient Evaluation
Jueun Kim, Baekrok Shin, Jihun Yun, Beomhan Baek, Minhak Song, Chulhee Yun

TL;DR
AMUSE introduces a novel optimization method that combines Muon's rapid progress with stable gradient evaluation, eliminating the need for learning rate schedules and enabling effective anytime training across vision and language tasks.
Contribution
The paper proposes AMUSE, a new optimizer that integrates Muon's fast bulk progress with stable averaging, improving performance without schedules.
Findings
AMUSE outperforms AdamW and Muon on vision and language tasks.
It achieves better performance-iteration trade-offs.
Supports anytime training without learning rate schedules.
Abstract
Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace (the river), while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
