Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

Xiaoyuan Liu; Wenxuan Wang; Youliang Yuan; Jen-tse Huang; Qiuzhi Liu; Pinjia He; Zhaopeng Tu

arXiv:2410.08145·cs.CL·June 3, 2025

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

Xiaoyuan Liu, Wenxuan Wang, Youliang Yuan, Jen-tse Huang, Qiuzhi Liu, Pinjia He, Zhaopeng Tu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates conflicts between visual data and internal knowledge in multimodal large language models, introducing a benchmark and framework to evaluate and improve their conflict-resolution abilities.

Contribution

It presents a novel automated framework and diagnostic benchmark for assessing vision-knowledge conflicts in MLLMs, along with analysis of existing mitigation strategies.

Findings

01

Approximately 20% of queries show over-reliance on parametric knowledge.

02

Existing mitigation strategies only partially reduce conflicts.

03

The proposed framework can scale to generate more conflict scenarios.

Abstract

This paper explores the problem of commonsense level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model's internal commonsense knowledge. To study this issue, we introduce an automated framework, augmented with human-in-the-loop quality control, to generate inputs designed to simulate and evaluate these conflicts in MLLMs. Using this framework, we have crafted a diagnostic benchmark consisting of 374 original images and 1,122 high-quality question-answer (QA) pairs. The benchmark covers two aspects of conflict and three question types, providing a thorough assessment tool. We apply this benchmark to assess the conflict-resolution capabilities of nine representative MLLMs from various model families. Our results indicate an evident over-reliance on parametric knowledge for approximately 20% of all queries, especially among…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xyliu-cs/ConflictVIS
pytorchOfficial

Videos

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs· underline

Taxonomy

TopicsSemantic Web and Ontologies · Translation Studies and Practices