MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

Feihu Huang; Yuning Luo; Songcan Chen

arXiv:2605.19619·cs.LG·May 20, 2026

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

Feihu Huang, Yuning Luo, Songcan Chen

PDF

TL;DR

This paper introduces MiMuon, a hybrid optimizer combining Muon and momentum-based SGD, with improved generalization error bounds and comparable convergence rates, validated on large models like Qwen3-0.6B and YOLO26m.

Contribution

The paper establishes the generalization error of the Muon optimizer and proposes MiMuon, which achieves lower generalization error while maintaining similar convergence properties.

Findings

01

MiMuon has a generalization error of O(1/N)

02

Muon's generalization error is O(1/(Nκ^T))

03

Numerical results show MiMuon is efficient on large models

Abstract

Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O (\frac{1}{N κ ^{T}})$ , where $N$ is training sample size, and $T$ denotes iteration number, and $κ > 0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.