Optimizer-Induced Mode Connectivity: From AdamW to Muon

Fangzhao Zhang; Sungyoon Kim; Erica Zhang; Yiqi Jiang; Mert Pilanci

arXiv:2605.09991·cs.AI·May 12, 2026

Optimizer-Induced Mode Connectivity: From AdamW to Muon

Fangzhao Zhang, Sungyoon Kim, Erica Zhang, Yiqi Jiang, Mert Pilanci

PDF

TL;DR

This paper investigates how different optimizers influence mode connectivity in neural networks, revealing optimizer-dependent structures and behaviors through theoretical analysis and empirical GPT-2 pretraining experiments.

Contribution

It demonstrates that optimizer choice affects the connectivity and structure of solutions, introducing new insights into optimizer-induced implicit regularization.

Findings

01

Solutions from the same optimizer form connected sets at large width.

02

Different optimizers can lead to disjoint or overlapping solution regions depending on regularization.

03

Cross-optimizer paths in GPT-2 traverse smooth transitions, preserving spectral properties.

Abstract

Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion- $K$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.