MERGETUNE: Continued Fine-Tuning of Vision-Language Models
Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler

TL;DR
This paper introduces MERGETUNE, a post hoc continued fine-tuning method for vision-language models that merges zero-shot and fine-tuned solutions to recover pretrained knowledge and improve generalization.
Contribution
MERGETUNE is a novel, model-agnostic strategy guided by linear mode connectivity that enhances pretrained knowledge retention after fine-tuning without architectural changes.
Findings
Improves base-novel generalization by +5.6% in harmonic mean.
Surpasses ensemble baselines with lower inference cost.
Achieves state-of-the-art results on robust fine-tuning evaluations.
Abstract
Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the…
Peer Reviews
Decision·ICLR 2026 Poster
- The proposed framework shows a strong performance gain compared to the simple model merging baseline - Achieving competitive performance with low inference cost - Theoretical analysis on the low-loss path phenomenon and its relation with easing catastrophic forgetting is interesting.
- While overall performance gain looks good, the improvement on the different tuning methods is unstable, sometimes hurts the base performance while largely improves the novel class, and sometimes degrades the novel performance while the base performance becomes better. This indicates the instability of the proposed framework - Figure 1 is really unfriendly for the reader. I strongly suggest that the author update the scale for different datasets to emphasize the performance difference between m
Strengths: 1. Introduces a novel post-hoc fine-tuning paradigm focusing on knowledge restoration instead of prevention, offering a new conceptual direction. 2. The approach is simple, elegant, and general, requiring no model modifications and applying to various adaptation methods. 3. Using linear mode connectivity as an explicit optimization objective provides solid geometric intuition and interpretability. 4. Experimental evaluation is extensive and convincing, demonstrating consistent gai
1. The theoretical justification is limited, and it is unclear why continued fine-tuning converges to a mode-connected region. 2. The final objective closely resembles standard $L_2$-regularized fine-tuning, so the novelty might be overstated. 3. The isotropic Hessian assumption in the surrogate loss is strong and unvalidated; its practical effect remains unclear. 4. The paper does not quantify the extra training cost or test scalability on larger or multimodal models.
(1) Introduces a conceptually fresh and practical paradigm—continued fine-tuning—that decouples adaptation from knowledge recovery, enabling post hoc enhancement of any existing fine-tuned VLM. (2) Proposes a theoretically grounded yet simple method (MERGETUNE) based on linear mode connectivity, with a clever second-order surrogate that eliminates the need for pretraining data replay—a major practical bottleneck. (3) Demonstrates consistent and significant improvements across diverse adaptat
(1) The surrogate loss assumes the Hessian of the pretraining loss at the zero-shot checkpoint is isotropic (H₁ ≈ μI) and that the gradient is near zero. While common, this may not hold for large-scale VLMs like CLIP trained on noisy web data. Could the authors provide empirical validation of these assumptions (e.g., via Hessian spectrum estimation on a subset) or discuss how violations might affect MERGETUNE’s performance? (2) In Table 1, MERGETUNE improves CoOp by +5.58% HM but only +0.36% o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
