Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu,, Luping Zhou, Shengsheng Wang, Jingdong Wang

TL;DR
This paper introduces a task-oriented multi-modal mutual learning approach that enhances vision-language models by using class-aware prompts and text-guided feature tuning, significantly improving generalization to new classes.
Contribution
It proposes a novel class-aware text prompt and text-guided feature tuning to better leverage image semantics and improve downstream task performance.
Findings
Outperforms existing methods on eleven classification benchmarks.
Achieves an average of 4.03% improvement on new classes.
Enhances base-to-new generalization performance significantly.
Abstract
Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an appropriate prompt for each specific task. Recent CoCoOp further boosts the base-to-new generalization performance via an image-conditional prompt. However, it directly fuses identical image semantics to prompts of different labels and significantly weakens the discrimination among different classes as shown in our experiments. Motivated by this observation, we first propose a class-aware text prompt (CTP) to enrich generated prompts with label-related image information. Unlike CoCoOp, CTP can effectively involve image semantics and avoid introducing extra ambiguities into different prompts. On the other hand, instead of reserving the complete image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContext Optimization · ALIGN
