What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?
Siting Li, Chenzhuang Du, Yue Zhao, Yu Huang, Hang Zhao

TL;DR
This paper models the robustness of multi-modal models with missing modalities using an information-theoretic approach and introduces UME-MMA, a flexible framework that enhances feature extraction and noise robustness, improving performance across various datasets.
Contribution
The paper proposes UME-MMA, a novel plug-and-play framework that leverages uni-modal pre-trained weights and missing modality data augmentation for robust multi-modal learning.
Findings
UME-MMA improves performance on audio-visual datasets.
UME-MMA enhances robustness to missing modalities.
The approach is compatible with various encoders and modalities.
Abstract
With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
