Rethinking Visual Prompt Learning as Masked Visual Token Modeling
Ning Liao, Bowen Shi, Xiaopeng Zhang, Min Cao, Junchi Yan, Qi Tian

TL;DR
This paper introduces VPTM, a novel visual prompt learning method that reformulates visual classification as masked token prediction on generative pre-trained models, enhancing performance and robustness.
Contribution
It is the first to adapt prompt learning to generative pre-trained visual models, unifying pre-training and downstream tasks through masked token modeling.
Findings
VPTM outperforms existing visual prompt methods.
VPTM demonstrates robustness to prompt variations.
VPTM achieves high efficiency in visual classification.
Abstract
Prompt learning has achieved great success in efficiently exploiting large-scale pre-trained models in natural language processing (NLP). It reformulates the downstream tasks as the generative pre-training ones to achieve consistency, thus improving the performance stably. However, when transferring it to the vision area, current visual prompt learning methods are almost designed on discriminative pre-trained models, and there is also a lack of careful design to unify the forms of pre-training and downstream tasks. To explore prompt learning on the generative pre-trained visual model, as well as keeping the task consistency, we propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction. In addition, we develop the prototypical verbalizer for mapping the predicted visual token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
