Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions
Gaurav Verma, Vishwa Vinay, Ryan A. Rossi, Srijan Kumar

TL;DR
This paper investigates the robustness of multimodal classifiers to cross-modal dilutions, showing that adding relevant but misleading text significantly reduces model accuracy in societal applications.
Contribution
The authors develop a model that generates relevant dilutions to test and demonstrate the brittleness of fusion-based multimodal classifiers.
Findings
Classifier performance drops by over 22% with dilutions.
Dilutions are highly relevant and topically coherent.
The method effectively exposes model vulnerabilities.
Abstract
As multimodal learning finds applications in a wide variety of high-stakes societal tasks, investigating their robustness becomes important. Existing work has focused on understanding the robustness of vision-and-language models to imperceptible variations on benchmark tasks. In this work, we investigate the robustness of multimodal classifiers to cross-modal dilutions - a plausible variation. We develop a model that, given a multimodal (image + text) input, generates additional dilution text that (a) maintains relevance and topical coherence with the image and existing text, and (b) when added to the original text, leads to misclassification of the multimodal input. Via experiments on Crisis Humanitarianism and Sentiment Detection tasks, we find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
