MaPLe: Multi-modal Prompt Learning

Muhammad Uzair Khattak; Hanoona Rasheed; Muhammad Maaz; Salman Khan,; Fahad Shahbaz Khan

arXiv:2210.03117·cs.CV·April 4, 2023·5 cites

MaPLe: Multi-modal Prompt Learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan,, Fahad Shahbaz Khan

PDF

Open Access 3 Repos 1 Models

TL;DR

MaPLe introduces a multi-modal prompt learning approach that jointly optimizes vision and language prompts to enhance alignment and generalization of CLIP-like models across diverse image recognition tasks.

Contribution

This work proposes a novel multi-modal prompt learning framework that improves vision-language alignment by coupling prompts and modeling stage-wise features, outperforming prior single-branch prompt methods.

Findings

01

Achieves 3.45% improvement on novel classes over state-of-the-art.

02

Improves overall harmonic mean by 2.72% across datasets.

03

Demonstrates effectiveness on tasks involving class generalization, dataset shifts, and domain adaptation.

Abstract

Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
tongyujun/Subspace_Prompting
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training