Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs
Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu, Ray Mooney

TL;DR
This paper investigates how current vision-language models are vulnerable to textual misinformation, revealing that they often prioritize misleading text over visual evidence, which significantly impacts their robustness and reliability.
Contribution
The study introduces the CONTEXT-VQA dataset and a benchmarking framework to evaluate VLMs' susceptibility to conflicting textual information, highlighting a critical robustness issue.
Findings
Models often override visual evidence with misleading text.
Performance drops over 48.2% after persuasive prompts.
Vulnerabilities are consistent across 11 state-of-the-art models.
Abstract
Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Misinformation and Its Impacts
