AdaMuon: Adaptive Muon Optimizer
Chongjie Si, Debing Zhang, Wei Shen

TL;DR
AdaMuon is a new optimizer that combines element-wise adaptivity with orthogonal updates, improving training efficiency and stability for large-scale neural networks without extra tuning.
Contribution
It introduces a novel combination of momentum estimation and orthogonal updates, enabling variance-adaptive scaling and stable training in large-scale neural network optimization.
Findings
Surpasses Adam by over 40% in training efficiency on large-scale tasks.
Maintains stability while improving convergence speed.
Allows reuse of existing learning rate schedules without tuning.
Abstract
We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40\% training efficiency in large-scale scenarios.
Peer Reviews
Decision·Submitted to ICLR 2026
1. It achieves better performance than Adam and Muon on GPT-2 and Qwen2.5, provides efficiency gains while maintaining convergence quality. 2.It demonstrates stable training behavior across different model scales.
1. The scalability of AdaMuon to even larger foundation models (e.g., 70B+) remains untested. 2. The experiments are insufficient, with missing hyperparameter analysis. 3. The effectiveness of accumulating the second-momentum term is not validated by comparing it with other options.
1. The work targets the development of more efficient and stable optimizers for training massive foundation models, which is a significant problem. 2. Combining Muon and Adam is an interesting research direction, and the experimental results are promising.
1. There is a disconnect between the stated motivation for the sign operation (stabilizing early-stage training) and the empirical results from the ablation study (Figure 3), which show the performance gap appearing in later stages. This questions the authors' understanding of why their own method works and weakens the overall narrative. 2. The paper presents AdaMuon as a principled combination, but the sign operation is a hard, non-linear transformation that fundamentally alters the information
1. Combining sign-stabilized orthogonal directions with element-wise EMA and RMS alignment is practical and simple; Alg. 1 is relatively easy to implement. 2. Thm. 1 provides a clean rationale for the sign transform due to the unique admissible element-wise map under natural constraints; the appendices provide a convergence discussion with assumptions stated explicitly. 3. On GPT-2 (4 scales) and Qwen2.5 (1.5B/7B) models, AdaMuon optimizer outperforms AdamW optimizer and slightly exceeds Muon
1. Comparisons are limited to the AdamW optimizer and the Muon optimizer, despite many relevant contenders ( for example, Adafactor/Shampoo/Lion optimizers). This paper explicitly argues that these two baselines “suffice”. 2. There exists little analysis on the choice of RMS target (0.2), ε, β, or the number of Newton–Schulz steps; the argument for omitting bias correction hinges on RMS alignment canceling multiplicative bias, but early-phase, non-constant bias or heavy-tailed coordinates might
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParticle Detector Development and Performance · Muon and positron interactions and applications · Particle physics theoretical and experimental studies
