Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

Qian-Wei Wang; Guanghao Meng; Ren Cai; Yaguang Song; Shu-Tao Xia

arXiv:2602.04337·cs.CV·February 5, 2026

Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

Qian-Wei Wang, Guanghao Meng, Ren Cai, Yaguang Song, Shu-Tao Xia

PDF

Open Access

TL;DR

This paper introduces CoFT, an unsupervised framework for adapting large vision-language models to downstream tasks without labeled data, using dual-model collaboration and prompt strategies to improve robustness and performance.

Contribution

It presents a novel unsupervised adaptation method, CoFT, that leverages dual-model collaboration and prompt-based pseudo-label filtering, eliminating the need for manual thresholds.

Findings

01

CoFT outperforms existing unsupervised methods in various tasks.

02

CoFT+ further improves performance through iterative and contrastive learning.

03

The approach achieves results comparable to few-shot supervised baselines.

Abstract

Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications