A Closer Look at Conditional Prompt Tuning for Vision-Language Models
Ji Zhang, Shihan Wu, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen

TL;DR
This paper identifies limitations in current conditional prompt tuning methods for vision-language models and introduces CaPT, a class-adaptive prompt tuning approach that improves generalization to new classes and enhances existing methods.
Contribution
The paper reveals the inefficacy of VII-conditioned prompts and proposes TCI-conditioned prompts with CaPT, boosting performance and generalization in vision-language models.
Findings
CaPT improves performance of unconditional PT baselines across 11 datasets.
CaPT can be integrated into existing PT schemes to mitigate BNT.
DeCaPT outperforms state-of-the-art conditional PT by 3.49% on average.
Abstract
Despite the great promise of Prompt Tuning (PT) in adapting large Vision-Language Pretrained Models (VLPMs) to downstream tasks, they often struggle to overcome the Base-New Tradeoff (BNT) dilemma: as VLPMs are better tuned to a base task, their ability to generalize to new tasks diminishes. Recent work on conditional PT addresses this problem by replacing static prompts with dynamic Visual Image Information (VII)-conditioned prompts, improving the model's generalization to new tasks to some extent. In this work, we first identify a critical issue with existing conditional PT methods: using VII as the "condition" of prompts yields suboptimal performance, and even random noise-conditioned prompts can outperform the VII-conditioned counterparts. On further analysis, we find that learning dynamic prompts conditioned on Textual Class Information (TCI) is the key to solving the BNT problem.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
