Unified Vision and Language Prompt Learning
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy

TL;DR
This paper introduces Unified Prompt Tuning (UPT), a method that jointly optimizes prompts across vision and language modalities, improving few-shot learning and domain generalization over unimodal prompt tuning methods.
Contribution
The paper proposes UPT, a novel approach that combines text and visual prompt tuning into a unified framework, addressing their individual limitations.
Findings
UPT outperforms unimodal prompt tuning on multiple datasets.
UPT achieves better trade-offs in few-shot learning scenarios.
UPT enhances domain generalization across diverse vision datasets.
Abstract
Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model's input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Text and Document Classification Technologies
MethodsContrastive Language-Image Pre-training
