MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
Feihu Huang, Yuning Luo, Songcan Chen

TL;DR
This paper introduces MiMuon, a hybrid optimizer combining Muon and momentum-based SGD, with improved generalization error bounds and comparable convergence rates, validated on large models like Qwen3-0.6B and YOLO26m.
Contribution
The paper establishes the generalization error of the Muon optimizer and proposes MiMuon, which achieves lower generalization error while maintaining similar convergence properties.
Findings
MiMuon has a generalization error of O(1/N)
Muon's generalization error is O(1/(Nκ^T))
Numerical results show MiMuon is efficient on large models
Abstract
Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of , where is training sample size, and denotes iteration number, and denotes minimum difference between singular values of gradient estimate. To enhance generalization of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
