Optimizer-Induced Mode Connectivity: From AdamW to Muon
Fangzhao Zhang, Sungyoon Kim, Erica Zhang, Yiqi Jiang, Mert Pilanci

TL;DR
This paper investigates how different optimizers influence mode connectivity in neural networks, revealing optimizer-dependent structures and behaviors through theoretical analysis and empirical GPT-2 pretraining experiments.
Contribution
It demonstrates that optimizer choice affects the connectivity and structure of solutions, introducing new insights into optimizer-induced implicit regularization.
Findings
Solutions from the same optimizer form connected sets at large width.
Different optimizers can lead to disjoint or overlapping solution regions depending on regularization.
Cross-optimizer paths in GPT-2 traverse smooth transitions, preserving spectral properties.
Abstract
Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion- family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
