Learning to Compose Soft Prompts for Compositional Zero-Shot Learning
Nihal V. Nayak, Peilin Yu, Stephen H. Bach

TL;DR
This paper proposes compositional soft prompting (CSP), a parameter-efficient method that enhances zero-shot compositionality in vision-language models by learning attribute and object tokens, leading to significant improvements on benchmark datasets.
Contribution
CSP introduces learnable attribute-object tokens for better zero-shot compositionality in large-scale pretrained models, outperforming existing soft prompting methods and baseline models.
Findings
CSP outperforms CLIP by 10.9% on average AUC.
CSP surpasses CoOp by 5.8% on average AUC.
Improves generalization to higher-order attribute compositions.
Abstract
We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs) like CLIP. We develop CSP for compositional zero-shot learning, the task of predicting unseen attribute-object compositions (e.g., old cat and young tiger). VLMs have a flexible text encoder that can represent arbitrary classes as natural language prompts but they often underperform task-specific architectures on the compositional zero-shot benchmark datasets. CSP treats the attributes and objects that define classes as learnable tokens of vocabulary. During training, the vocabulary is tuned to recognize classes that compose tokens in multiple ways (e.g., old cat and white cat). At test time, we recompose the learned attribute-object vocabulary in new combinations to recognize novel classes. We show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
MethodsContext Optimization
