Conditional Prompt Learning for Vision-Language Models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu

TL;DR
This paper introduces CoCoOp, a dynamic prompt learning method for vision-language models that improves generalization to unseen classes and domains by generating input-conditional prompts, addressing overfitting issues of previous static prompt methods.
Contribution
The paper proposes Conditional Context Optimization (CoCoOp), a novel dynamic prompt learning approach that enhances generalization and transferability of vision-language models to unseen classes and domains.
Findings
CoCoOp outperforms CoOp on unseen classes within the same dataset.
CoCoOp demonstrates better transferability across different datasets.
The method improves domain generalization performance.
Abstract
With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsContext Optimization · Balanced Selection · Contrastive Language-Image Pre-training
