FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, Jin Liu

TL;DR
FedMuon introduces a matrix orthogonalization-based optimizer for federated learning, significantly accelerating convergence and reducing communication rounds, especially in heterogeneous non-IID settings, by addressing client drift and aligning local-global updates.
Contribution
The paper proposes FedMuon, a novel federated optimizer with matrix orthogonalization, momentum aggregation, and local-global alignment, to improve convergence and reduce communication in non-IID federated learning.
Findings
FedMuon accelerates convergence in IID settings.
FedMuon reduces communication rounds compared to baselines.
FedMuon improves test accuracy in language and vision models.
Abstract
The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The adaptation of Muon in FL context is new however the concepts of momentum aggregation and gradient alignment have been explored in prior literature. The empirical results show substantial improvements in terms of convergence speed and communication efficiency, particularly under non-IID data conditions. The paper provides theoretical guarantees regarding the convergence of FedMuon, with linear speedup in convergence rate under non-convex settings.
1. Although the paper mentions that additional hyperparameter configurations are provided in the appendix, there are none. The learning rate (LR) ranges are provided, but final values for these hyperparameters are not specified. The budget spent on hyperparameter tuning (e.g., number of search iterations or resource allocation) should also be included to provide transparency about how the hyperparameters were selected and to avoid any bias in the results. 2. While the paper compares FedMuon to v
- Simple algorithm combining orthogonalized steps with global direction alignment and momentum reuse, plus an SVD-based compression knob. - Broad empirical sweep across CNN/Transformer/LLM settings.
- The experimental section omits several key federated optimizers that are directly relevant to the paper’s stated goals of mitigating client drift and improving communication efficiency. Missing baselines include MIME (NeurIPS ’21), FedAdam/Yogi/Adagrad from FedOpt (ICLR ’21), FedProx, FedNova, and FedDyn, all of which are standard drift reducers or adaptive aggregators. Their exclusion makes it difficult to attribute performance gains specifically to matrix orthogonalization rather than known
1. The integration of matrix orthogonalization into federated learning is novel, as it leverages the geometric structure of weights rather than treating them as independent parameters. 2. The proposed Local-Global Alignment and Momentum Aggregation mechanisms are elegant and directly address Muon’s weaknesses under non-IID conditions, particularly client drift. 3. The experimental coverage is broad, spanning both vision and NLP domains, which is uncommon for FL studies that often focus solely on
1. Although the paper claims only a “5% overhead” due to Muon’s Newton-Schulz iterations and SVD compression, no actual runtime, FLOP count, or memory usage benchmarks are provided. 2. While Table 7 reports a 1.05× communication cost for SVD-compressed momentum, it does not present total end-to-end training time, making it difficult to assess real-world efficiency. 3. There is no direct comparison between full Muon and truncated Muon in centralized training, so it remains unclear whether the gai
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques · Cryptography and Data Security
