Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models

Shijun Yang; Xiang Zhang; Wanqing Zhao; Hangzai Luo; Sheng Zhong; Jinye Peng; Jianping Fan

arXiv:2507.08410·cs.CV·July 14, 2025

Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models

Shijun Yang, Xiang Zhang, Wanqing Zhao, Hangzai Luo, Sheng Zhong, Jinye Peng, Jianping Fan

PDF

Open Access

TL;DR

This paper introduces MuGCP, a novel multi-modal prompt learning framework that enhances vision-language models by generating semantic and visual prompts through mutual guidance, improving generalization and multi-modal task performance.

Contribution

MuGCP leverages multi-modal large language models for adaptive prompt generation and introduces an attention mutual-guidance module for better multi-modal alignment.

Findings

01

Outperforms state-of-the-art on 14 datasets

02

Enhances class embedding modeling for unseen classes

03

Improves multi-modal task accuracy

Abstract

Prompt learning facilitates the efficient adaptation of Vision-Language Models (VLMs) to various downstream tasks. However, it faces two significant challenges: (1) inadequate modeling of class embedding distributions for unseen instances, leading to suboptimal generalization on novel classes; (2) prevailing methodologies predominantly confine cross-modal alignment to the final output layer of vision and text encoders, which fundamentally limits their capacity to preserve topological consistency with pre-trained multi-modal embedding spaces. To this end, we introduce MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning), a novel paradigm designed for conditional prompt generation. MuGCP leverages Multi-modal Large Language Models (MLLMs) as conditional prompt learners to adaptively generate Semantic Conditional Prompts (SCP) that incorporate rich, fine-grained high-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques