What does CLIP know about a red circle? Visual prompt engineering for   VLMs

Aleksandar Shtedritski; Christian Rupprecht; Andrea Vedaldi

arXiv:2304.06712·cs.CV·August 21, 2023·1 cites

What does CLIP know about a red circle? Visual prompt engineering for VLMs

Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi

PDF

Open Access 1 Video

TL;DR

This paper investigates visual prompt engineering for CLIP, revealing that simple image edits like drawing shapes can enhance its ability to perform complex vision tasks such as referring expression comprehension and keypoint localization.

Contribution

The study introduces a novel visual prompting method for CLIP, demonstrating its emergent capabilities in discriminative tasks beyond classification through simple image modifications.

Findings

01

Achieved state-of-the-art zero-shot referring expression comprehension.

02

Demonstrated strong performance in keypoint localization.

03

Revealed CLIP's emergent ability to focus on regions via visual prompts.

Abstract

Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation. Despite that, their capabilities for solving novel discriminative tasks via prompting fall behind those of large language models, such as GPT-3. Here we explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text. In particular, we discover an emergent ability of CLIP, where, by simply drawing a red circle around an object, we can direct the model's attention to that region, while also maintaining global information. We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks. Finally, we draw attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

What does CLIP know about a red circle? Visual prompt engineering for VLMs· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Weight Decay · Adam · Dropout · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization