Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Yuxing Liu; Jianyu Wang; Tong Zhang

arXiv:2605.06654·cs.LG·May 8, 2026

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Yuxing Liu, Jianyu Wang, Tong Zhang

PDF

TL;DR

Using the same optimizer for full finetuning as pretraining improves knowledge retention and performance in large language models, revealing the importance of optimizer-model consistency.

Contribution

This paper introduces the concept of optimizer-model consistency, supported by experiments and theory, showing how optimizer choice influences model forgetting and performance.

Findings

01

Full finetuning with the same optimizer as pretraining reduces forgetting.

02

Optimizer regularization affects model landscape and knowledge retention.

03

Muon optimizer tends to memorize, hindering reasoning tasks.

Abstract

Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.