Integrated Structural Prompt Learning for Vision-Language Models
Jiahui Wang, Qin Xu, Bo Jiang, Bin Luo

TL;DR
This paper introduces an Integrated Structural Prompt (ISP) method for vision-language models that models structural relationships within and across modalities to improve transferability and generalization in various tasks.
Contribution
The paper proposes a novel ISP framework with structural prompt modules and a sample probing mechanism to enhance information interaction and prevent overfitting in VLMs.
Findings
ISP achieves state-of-the-art performance in base-to-new generalization.
ISP improves cross-dataset evaluation results.
ISP enhances domain generalization capabilities.
Abstract
Prompt learning methods have significantly extended the transferability of pre-trained Vision-Language Models (VLMs) like CLIP for various downstream tasks. These methods adopt handcraft templates or learnable vectors to provide text or image instructions in fine-tuning VLMs. However, most existing works ignore the structural relationships between learnable prompts and tokens within and between modalities. Moreover, balancing the performance of base and new classes remains a significant challenge. In this paper, we propose an Integrated Structural Prompt (ISP) for VLMs to enhance the interaction of information representations between the text and image branches. ISP introduces self-structural and cross-structural prompt modules to model the structural relationships between learnable prompts and frozen tokens within and across modalities. This enables efficient information transfer while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsADaptive gradient method with the OPTimal convergence rate · Balanced Selection · Contrastive Language-Image Pre-training
