IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning
Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, Dinesh Manocha

TL;DR
IntCoOp introduces an interpretable prompt-tuning method that incorporates compositional attributes to improve image-text alignment and few-shot learning performance in vision-language models.
Contribution
The paper proposes IntCoOp, a novel prompt-tuning approach that learns attribute-level inductive biases for better interpretability and improved performance over existing methods.
Findings
Outperforms state-of-the-art prompt tuning frameworks.
Improves average performance by 7.35% in 16-shot setting.
Enhances generalization to novel classes and domain shifts.
Abstract
Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging information from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a "green" tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation, we propose a novel and interpretable prompt-tuning method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training · ALIGN · Contrastive Language-Image Pre-training · Context Optimization
