Learning Visual Prompts for Guiding the Attention of Vision Transformers
Razieh Rezaei, Masoud Jalili Sabet, Jindong Gu, Daniel Rueckert,, Philip Torr, Ashkan Khakzar

TL;DR
This paper introduces a method to learn visual prompts that guide pre-trained vision transformers' attention to specific image regions without fine-tuning, improving task-specific focus in a self-supervised manner.
Contribution
It proposes a novel self-supervised approach to learn visual prompts that direct attention in vision transformers without requiring manual markers or model fine-tuning.
Findings
Effective across various pre-trained vision encoders
Does not require annotations or fine-tuning
Guides attention to target regions successfully
Abstract
Visual prompting infuses visual information into the input image to adapt models toward specific predictions and tasks. Recently, manually crafted markers such as red circles are shown to guide the model to attend to a target region on the image. However, these markers only work on models trained with data containing those markers. Moreover, finding these prompts requires guesswork or prior knowledge of the domain on which the model is trained. This work circumvents manual design constraints by proposing to learn the visual prompts for guiding the attention of vision transformers. The learned visual prompt, added to any input image would redirect the attention of the pre-trained vision transformer to its spatial location on the image. Specifically, the prompt is learned in a self-supervised manner without requiring annotations and without fine-tuning the vision transformer. Our…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper is easy to follow and understand. - This method can be applied to different pre-trained ViT, regardless of their pre-training supervision methods, and does not require annotation or fine-tuning of the ViT.
- The paper lacks essential details and descriptions, which may lead readers unfamiliar with the preceding work to struggle with understanding what is being tested or implemented. For example, the methodology for testing on the CUB dataset is not clearly outlined, and the significance of "K2N" on line 478, which stands for keypoint to name, is not explained. - Building on the concept that a "red circle" can direct CLIP's attention to a specific area, as demonstrated in prior work by Shtedritski
1. The approach is model-agnostic and does not require fine-tuning, making it versatile across different models. 2. The method demonstrates strong generalization capabilities across various models and datasets. 3. The presentation is clear and easy to understand.
1. The paper lacks sufficient justification for the advantages of the approach compared to directly collecting and fine-tuning with annotated visual prompt data, particularly in terms of performance and efficiency benefits. 2. There are some typos: some citation formats need correction; Table 2 shows inconsistent decimal places across the results; the ordering in Figures 4 and 5 appears to be incorrect.
The approach appears to be adaptable, allowing self-supervised learning of visual prompts for various vision encoders without needing labeled data. This flexibility is particularly valuable for adapting vision transformers across tasks and models without costly retraining or fine-tuning.
There are a few concerns for the algorithmic design: 1. Although the method is intended as a self-supervised approach, its performance is primarily evaluated using annotated datasets like CUB (for keypoint detection) and RefCOCO (for object localization), where specific body parts or objects are pre-labeled. This reliance contrasts with the self-supervised nature of the learning process, potentially limiting practical utility. For instance, if access to labeled datasets is restricted or unavaila
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Vision Transformer
