Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Chi Zhang; Wenxuan Ding; Jiale Liu; Mingrui Wu; Qingyun Wu; Ray Mooney

arXiv:2601.19202·cs.CL·January 28, 2026

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu, Ray Mooney

PDF

Open Access 1 Video

TL;DR

This paper investigates how current vision-language models are vulnerable to textual misinformation, revealing that they often prioritize misleading text over visual evidence, which significantly impacts their robustness and reliability.

Contribution

The study introduces the CONTEXT-VQA dataset and a benchmarking framework to evaluate VLMs' susceptibility to conflicting textual information, highlighting a critical robustness issue.

Findings

01

Models often override visual evidence with misleading text.

02

Performance drops over 48.2% after persuasive prompts.

03

Vulnerabilities are consistent across 11 state-of-the-art models.

Abstract

Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Misinformation and Its Impacts