PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities

Jiajun Chen; Sai Cheng; Yutao Yuan; Yirui Zhang; Haitao Yuan; Peng Peng; Yi Zhong

arXiv:2511.10997·cs.CV·November 17, 2025

PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities

Jiajun Chen, Sai Cheng, Yutao Yuan, Yirui Zhang, Haitao Yuan, Peng Peng, Yi Zhong

PDF

Open Access 1 Video

TL;DR

PROMISE introduces a hierarchical contrastive learning framework with prompt-attention to enhance cross-modal representations, maintaining robustness even when some modalities are missing, outperforming existing methods.

Contribution

It presents a novel prompt-attentive hierarchical contrastive learning approach specifically designed for robust multimodal representation with missing modalities.

Findings

01

Outperforms state-of-the-art multimodal methods on benchmark datasets.

02

Effectively maintains cross-modal consistency with missing modalities.

03

Demonstrates robustness through extensive ablation studies.

Abstract

Multimodal models integrating natural language and visual information have substantially improved generalization of representation models. However, their effectiveness significantly declines in real-world situations where certain modalities are missing or unavailable. This degradation primarily stems from inconsistent representation learning between complete multimodal data and incomplete modality scenarios. Existing approaches typically address missing modalities through relatively simplistic generation methods, yet these approaches fail to adequately preserve cross-modal consistency, leading to suboptimal performance. To overcome this limitation, we propose a novel multimodal framework named PROMISE, a PROMpting-Attentive HIerarchical ContraStive LEarning approach designed explicitly for robust cross-modal representation under conditions of missing modalities. Specifically, PROMISE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection