Robust Multimodal Learning via Cross-Modal Proxy Tokens

Md Kaykobad Reza; Ameya Patil; Mashhour Solh; M. Salman Asif

arXiv:2501.17823·cs.CV·October 28, 2025

Robust Multimodal Learning via Cross-Modal Proxy Tokens

Md Kaykobad Reza, Ameya Patil, Mashhour Solh, M. Salman Asif

PDF

Open Access

TL;DR

This paper introduces cross-modal proxy tokens (CMPTs), a novel approach that improves the robustness of multimodal models to missing modalities by approximating absent class tokens through attention mechanisms, without extra modality generation.

Contribution

The paper proposes CMPTs, a simple and efficient method that enhances multimodal model robustness to missing data without requiring auxiliary networks or explicit modality generation.

Findings

01

Outperforms state-of-the-art baselines across multiple datasets.

02

Maintains strong performance with all modalities available.

03

Efficiently handles various missing rates with minimal computational overhead.

Abstract

Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Speech and dialogue systems · Speech Recognition and Synthesis