Unsupervised Prompt Learning for Vision-Language Models
Tony Huang, Jack Chu, Fangyun Wei

TL;DR
This paper introduces an unsupervised prompt learning method for vision-language models like CLIP, eliminating the need for labeled data and prompt engineering, and achieving superior transfer performance across multiple datasets.
Contribution
It is the first to incorporate unsupervised learning into prompt learning, enhancing CLIP's transfer capabilities without requiring labeled target data.
Findings
Outperforms original CLIP with prompt engineering on ImageNet and 10 other datasets.
Competitive with 8-shot CoOp and TIP-Adapter methods.
Demonstrates effectiveness of unsupervised prompt learning in vision-language models.
Abstract
Contrastive vision-language models like CLIP have shown great progress in transfer learning. In the inference stage, the proper text description, also known as prompt, needs to be carefully designed to correctly classify the given images. In order to avoid laborious prompt engineering, recent works such as CoOp, CLIP-Adapter and Tip-Adapter propose to adapt vision-language models for downstream image recognition tasks on a small set of labeled data. Though promising improvements are achieved, requiring labeled data from the target datasets may restrict the scalability. In this paper, we explore a different scenario, in which the labels of the target datasets are unprovided, and we present an unsupervised prompt learning (UPL) approach to avoid prompt engineering while simultaneously improving transfer performance of CLIP-like vision-language models. As far as we know, UPL is the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsAdapter · Context Optimization · Contrastive Language-Image Pre-training
