SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
Hengrui Zhang, Boao Kong, Jiahe Geng, Zhengyang Huang

TL;DR
SUDA-Muon introduces a unified framework for fully decentralized Muon algorithms, enabling modular design and providing convergence guarantees, with empirical validation on CIFAR-100 and GPT-2 tasks.
Contribution
It proposes SUDA, a primal-dual communication template that modularizes decentralized Muon algorithms and establishes convergence and boundary conditions.
Findings
SUDA-Muon achieves non-asymptotic convergence with topology-independent bounds.
Tracking-before-polarization is necessary to avoid non-stationary fixed points.
In non-IID settings, SUDA-Muon outperforms DeMuon in accuracy and loss.
Abstract
Fully decentralized Muon is difficult because its nonlinear matrix-sign operator does not commute with linear gossip averaging. This makes decentralized Muon a structural design problem: in designing the algorithm, one must distinguish modular components from non-modular ones. We propose \sudamuon{}, which realizes this separation through a unified primal--dual communication template called SUDA; within this template, ED/D, EXTRA, and gradient tracking become modular backbone choices. We prove a topology-separated non-asymptotic convergence guarantee in the nuclear-norm geometry: the dominant term scales as and does not explicitly involve graph quantities, identifying the communication backbone as the modular axis in the structure design. We then establish two complementary non-modular boundaries. Internally, tracking-before-polarization is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
