TL;DR
This paper investigates the challenges of fine-tuning pretrained models with Muon instead of Adam, revealing that optimizer mismatch disrupts knowledge and proposing LoRA to mitigate this issue across tasks.
Contribution
It identifies the optimizer mismatch as a key factor in fine-tuning performance degradation and demonstrates how LoRA can effectively reduce this mismatch.
Findings
LoRA reduces the performance gap between Adam and Muon in fine-tuning.
Mismatch severity correlates with update strength and causes knowledge disruption.
LoRA variants and rank adjustments further confirm the impact of update strength.
Abstract
Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
