Integrated Structural Prompt Learning for Vision-Language Models

Jiahui Wang; Qin Xu; Bo Jiang; Bin Luo

arXiv:2507.05677·cs.CV·July 10, 2025

Integrated Structural Prompt Learning for Vision-Language Models

Jiahui Wang, Qin Xu, Bo Jiang, Bin Luo

PDF

Open Access

TL;DR

This paper introduces an Integrated Structural Prompt (ISP) method for vision-language models that models structural relationships within and across modalities to improve transferability and generalization in various tasks.

Contribution

The paper proposes a novel ISP framework with structural prompt modules and a sample probing mechanism to enhance information interaction and prevent overfitting in VLMs.

Findings

01

ISP achieves state-of-the-art performance in base-to-new generalization.

02

ISP improves cross-dataset evaluation results.

03

ISP enhances domain generalization capabilities.

Abstract

Prompt learning methods have significantly extended the transferability of pre-trained Vision-Language Models (VLMs) like CLIP for various downstream tasks. These methods adopt handcraft templates or learnable vectors to provide text or image instructions in fine-tuning VLMs. However, most existing works ignore the structural relationships between learnable prompts and tokens within and between modalities. Moreover, balancing the performance of base and new classes remains a significant challenge. In this paper, we propose an Integrated Structural Prompt (ISP) for VLMs to enhance the interaction of information representations between the text and image branches. ISP introduces self-structural and cross-structural prompt modules to model the structural relationships between learnable prompts and frozen tokens within and across modalities. This enables efficient information transfer while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsADaptive gradient method with the OPTimal convergence rate · Balanced Selection · Contrastive Language-Image Pre-training