DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision-Language Transformers to Missing Modalities

Jueqing Lu; Yuanyuan Qi; Xiaohao Yang; Shuaicheng Niu; Fucai Ke; Shujie Zhou; Wei Tan; Jionghao Lin; Wray Buntine; Hamid Rezatofighi; Lan Du

arXiv:2505.08283·cs.LG·November 18, 2025

DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision-Language Transformers to Missing Modalities

Jueqing Lu, Yuanyuan Qi, Xiaohao Yang, Shuaicheng Niu, Fucai Ke, Shujie Zhou, Wei Tan, Jionghao Lin, Wray Buntine, Hamid Rezatofighi, Lan Du

PDF

Open Access

TL;DR

This paper introduces Decoupled Prototype Learning (DPL), a novel prediction head architecture for vision-language transformers that improves robustness to missing modalities by adaptively selecting class prototypes based on available input information.

Contribution

DPL is a new architecture that explicitly adjusts decision processes for missing modalities, outperforming existing methods on multiple multimodal datasets.

Findings

01

DPL significantly improves robustness to missing modalities.

02

DPL outperforms state-of-the-art methods on several datasets.

03

DPL maintains compatibility with existing prompt-based frameworks.

Abstract

The performance of Visio-Language Transformers drops sharply when an input modality (e.g., image) is missing, because the model is forced to make predictions using incomplete information. Existing missing-aware prompt methods help reduce this degradation, but they still rely on conventional prediction heads (e.g., a Fully-Connected layer) that compute class scores in the same way regardless of which modality is present or absent. We introduce Decoupled Prototype Learning (DPL), a new prediction head architecture that explicitly adjusts its decision process to the observed input modalities. For each class, DPL selects a set of prototypes specific to the current missing-modality cases (image-missing, text-missing, or mixed-missing). Each prototype is then decomposed into image-specific and text-specific components, enabling the head to make decisions that depend on the information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning