Unsupervised Prototype Adapter for Vision-Language Models
Yi Zhang, Ce Zhang, Xueting Hu, Zhihai He

TL;DR
This paper introduces an unsupervised fine-tuning method called Unsupervised Prototype Adapter (UP-Adapter) for vision-language models like CLIP, enabling effective downstream recognition without annotated data.
Contribution
The paper proposes a novel unsupervised approach to adapt vision-language models using automatically selected samples and class prototypes, eliminating the need for labeled data.
Findings
Outperforms 8-shot CoOp and Tip-Adapter in image recognition tasks.
Achieves superior results on domain generalization benchmarks.
Demonstrates effectiveness without requiring annotated datasets.
Abstract
Recently, large-scale pre-trained vision-language models (e.g. CLIP and ALIGN) have demonstrated remarkable effectiveness in acquiring transferable visual representations. To leverage the valuable knowledge encoded within these models for downstream tasks, several fine-tuning approaches, including prompt tuning methods and adapter-based methods, have been developed to adapt vision-language models effectively with supervision. However, these methods rely on the availability of annotated samples, which can be labor-intensive and time-consuming to acquire, thus limiting scalability. To address this issue, in this work, we design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContext Optimization · Contrastive Language-Image Pre-training · Residual Connection · Adapter
