Dynamic Multimodal Prototype Learning in Vision-Language Models

Xingyu Zhu; Shuo Wang; Beier Zhu; Miaoge Li; Yunfan Li; Junfeng Fang; Zhicai Wang; Dongsheng Wang; Hanwang Zhang

arXiv:2507.03657·cs.CV·December 2, 2025

Dynamic Multimodal Prototype Learning in Vision-Language Models

Xingyu Zhu, Shuo Wang, Beier Zhu, Miaoge Li, Yunfan Li, Junfeng Fang, Zhicai Wang, Dongsheng Wang, Hanwang Zhang

PDF

Open Access

TL;DR

ProtoMM is a training-free, multimodal prototype learning framework that dynamically adapts vision-language models during testing by combining textual and visual features, improving zero-shot classification accuracy.

Contribution

It introduces a novel multimodal prototype approach that dynamically updates during test time, addressing ambiguity in class descriptions and enhancing model generalization.

Findings

01

Achieves 1.03% average accuracy improvement on ImageNet benchmarks.

02

Effectively combines textual and visual features for prototype learning.

03

Demonstrates robustness across 15 zero-shot benchmarks.

Abstract

With the increasing attention to pre-trained vision-language models (VLMs), \eg, CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce \textbf{ProtoMM}, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques