MaPLe: Multi-modal Prompt Learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan,, Fahad Shahbaz Khan

TL;DR
MaPLe introduces a multi-modal prompt learning approach that jointly optimizes vision and language prompts to enhance alignment and generalization of CLIP-like models across diverse image recognition tasks.
Contribution
This work proposes a novel multi-modal prompt learning framework that improves vision-language alignment by coupling prompts and modeling stage-wise features, outperforming prior single-branch prompt methods.
Findings
Achieves 3.45% improvement on novel classes over state-of-the-art.
Improves overall harmonic mean by 2.72% across datasets.
Demonstrates effectiveness on tasks involving class generalization, dataset shifts, and domain adaptation.
Abstract
Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
