Understanding and Improving Visual Prompting: A Label-Mapping Perspective
Aochuan Chen, Yuguang Yao, Pin-Yu Chen, Yihua Zhang, Sijia Liu

TL;DR
This paper investigates the role of label mapping in visual prompting for vision tasks, proposing a new iterative framework that enhances target task accuracy by optimizing label mappings, especially when combined with CLIP models.
Contribution
It introduces ILM-VP, an iterative label mapping framework that improves visual prompting accuracy and integrates label mapping with CLIP to enhance text prompt selection.
Findings
ILM-VP outperforms existing VP methods significantly.
Reprogramming ResNet-18 achieves up to 7.9% accuracy gain.
CLIP-based VP improves accuracy by up to 13.7%.
Abstract
We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better 'quality' of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
