Muon Does Not Converge on Convex Lipschitz Functions
Tetiana Parshakova, Ahmed Khaled, Michael Crawshaw, Guillaume Garrigos, Robert M. Gower

TL;DR
This paper demonstrates that Muon, a deep learning optimizer, does not converge on convex Lipschitz functions and that its success likely depends on properties like smoothness rather than convex Lipschitz assumptions.
Contribution
The paper shows Muon fails to converge on convex Lipschitz functions and proposes error feedback as a fix, highlighting the limitations of convex Lipschitz theory for Muon.
Findings
Muon does not converge on convex Lipschitz functions regardless of learning rate.
Error feedback restores convergence of Muon and similar methods.
Error feedback degrades Muon's performance in image classification and language modeling tasks.
Abstract
Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
