TL;DR
This paper introduces CoOp, a simple learnable prompt method for vision-language models like CLIP, significantly reducing prompt engineering effort and improving downstream image recognition performance with minimal training data.
Contribution
Proposes Context Optimization (CoOp), a learnable prompt approach that adapts pre-trained vision-language models for various tasks without changing their parameters.
Findings
CoOp outperforms hand-crafted prompts with as few as one or two shots.
With 16 shots, CoOp achieves around 15% average gain over traditional prompts.
CoOp demonstrates strong domain generalization compared to zero-shot models.
Abstract
Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training · Context Optimization
