Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

Mingyuan Zhang; Yue Bai; Yifan Wang; Yiyang Huang; Yun Fu

arXiv:2512.23073·cs.LG·December 30, 2025

Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

Mingyuan Zhang, Yue Bai, Yifan Wang, Yiyang Huang, Yun Fu

PDF

Open Access

TL;DR

This paper proposes a novel fine-tuning approach for vision-language models called Mask Fine-Tuning (MFT), which reorganizes internal subnetworks via learnable gating, outperforming traditional weight update methods like LoRA and full fine-tuning.

Contribution

It introduces MFT for VLMs, demonstrating that reconfiguring internal structures can surpass weight update methods without changing the pre-trained backbone.

Findings

01

MFT outperforms LoRA variants in experiments.

02

MFT surpasses full fine-tuning in performance.

03

Effective adaptation from internal reorganization alone.

Abstract

Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis