What does CLIP know about a red circle? Visual prompt engineering for VLMs
Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi

TL;DR
This paper investigates visual prompt engineering for CLIP, revealing that simple image edits like drawing shapes can enhance its ability to perform complex vision tasks such as referring expression comprehension and keypoint localization.
Contribution
The study introduces a novel visual prompting method for CLIP, demonstrating its emergent capabilities in discriminative tasks beyond classification through simple image modifications.
Findings
Achieved state-of-the-art zero-shot referring expression comprehension.
Demonstrated strong performance in keypoint localization.
Revealed CLIP's emergent ability to focus on regions via visual prompts.
Abstract
Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation. Despite that, their capabilities for solving novel discriminative tasks via prompting fall behind those of large language models, such as GPT-3. Here we explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text. In particular, we discover an emergent ability of CLIP, where, by simply drawing a red circle around an object, we can direct the model's attention to that region, while also maintaining global information. We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks. Finally, we draw attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
What does CLIP know about a red circle? Visual prompt engineering for VLMs· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Weight Decay · Adam · Dropout · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization
