Robust Multimodal Learning via Cross-Modal Proxy Tokens
Md Kaykobad Reza, Ameya Patil, Mashhour Solh, M. Salman Asif

TL;DR
This paper introduces cross-modal proxy tokens (CMPTs), a novel approach that improves the robustness of multimodal models to missing modalities by approximating absent class tokens through attention mechanisms, without extra modality generation.
Contribution
The paper proposes CMPTs, a simple and efficient method that enhances multimodal model robustness to missing data without requiring auxiliary networks or explicit modality generation.
Findings
Outperforms state-of-the-art baselines across multiple datasets.
Maintains strong performance with all modalities available.
Efficiently handles various missing rates with minimal computational overhead.
Abstract
Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Speech and dialogue systems · Speech Recognition and Synthesis
