Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner
Qian-Wei Wang, Guanghao Meng, Ren Cai, Yaguang Song, Shu-Tao Xia

TL;DR
This paper introduces CoFT, an unsupervised framework for adapting large vision-language models to downstream tasks without labeled data, using dual-model collaboration and prompt strategies to improve robustness and performance.
Contribution
It presents a novel unsupervised adaptation method, CoFT, that leverages dual-model collaboration and prompt-based pseudo-label filtering, eliminating the need for manual thresholds.
Findings
CoFT outperforms existing unsupervised methods in various tasks.
CoFT+ further improves performance through iterative and contrastive learning.
The approach achieves results comparable to few-shot supervised baselines.
Abstract
Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
