PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities
Jiajun Chen, Sai Cheng, Yutao Yuan, Yirui Zhang, Haitao Yuan, Peng Peng, Yi Zhong

TL;DR
PROMISE introduces a hierarchical contrastive learning framework with prompt-attention to enhance cross-modal representations, maintaining robustness even when some modalities are missing, outperforming existing methods.
Contribution
It presents a novel prompt-attentive hierarchical contrastive learning approach specifically designed for robust multimodal representation with missing modalities.
Findings
Outperforms state-of-the-art multimodal methods on benchmark datasets.
Effectively maintains cross-modal consistency with missing modalities.
Demonstrates robustness through extensive ablation studies.
Abstract
Multimodal models integrating natural language and visual information have substantially improved generalization of representation models. However, their effectiveness significantly declines in real-world situations where certain modalities are missing or unavailable. This degradation primarily stems from inconsistent representation learning between complete multimodal data and incomplete modality scenarios. Existing approaches typically address missing modalities through relatively simplistic generation methods, yet these approaches fail to adequately preserve cross-modal consistency, leading to suboptimal performance. To overcome this limitation, we propose a novel multimodal framework named PROMISE, a PROMpting-Attentive HIerarchical ContraStive LEarning approach designed explicitly for robust cross-modal representation under conditions of missing modalities. Specifically, PROMISE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection
