CVPT: Cross Visual Prompt Tuning
Lingyun Huang, Jianxu Mao, Junfei Yi, Ziming Tao, Yaonan Wang

TL;DR
CVPT introduces a cross-attention mechanism in prompt tuning for vision models, effectively addressing VPT's limitations by preserving self-attention and improving performance across diverse datasets.
Contribution
This work proposes Cross Visual Prompt Tuning (CVPT), a novel prompt method with a cross-attention module that enhances feature interaction while maintaining self-attention integrity.
Findings
CVPT outperforms VPT on 25 datasets, including a 4%+ accuracy boost on VTAB-1K.
CVPT rivals leading adapter-based methods in performance and efficiency.
The code for CVPT is publicly available at the provided GitHub link.
Abstract
Parameter-Efficient Fine-Tuning (PEFT) has emerged to mitigate the computational demands of large-scale models. Within computer vision, adapter-based PEFT methods are often favored over prompt-based approaches like Visual Prompt Tuning (VPT) due to the latter's performance and efficiency limitations. Our analysis reveals that VPT's shortcomings stem from its prompt deployment strategy, which can distort the model's inherent self-attention mechanism. To address this, we propose Cross Visual Prompt Tuning (CVPT). CVPT introduces a cross-attention module to directly model interactions between prompts and image tokens. This design decouples the prompts from the input sequence, preserving the original self-attention integrity while enabling efficient feature integration. Furthermore, we employ a weight-sharing mechanism for cross-attention initialization, which enhances representative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces
