VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Silin Cheng, Kai Han

TL;DR
VaMP introduces a variational prompt learning framework for vision-language models that personalizes prompts based on input uncertainty, improving few-shot and domain generalization performance.
Contribution
The paper presents a novel variational approach to multi-modal prompt tuning, incorporating instance-specific prompts and uncertainty modeling for better adaptation.
Findings
Achieves state-of-the-art results on few-shot benchmarks.
Effectively models uncertainty and instance variation.
Enhances domain generalization in vision-language tasks.
Abstract
Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
