Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

Yexin Liu; Zhengyang Liang; Yueze Wang; Xianfeng Wu; Feilong Tang,; Muyang He; Jian Li; Zheng Liu; Harry Yang; Sernam Lim; Bo Zhao

arXiv:2406.10638·cs.CV·March 20, 2025·1 cites

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang,, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, Bo Zhao

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper reveals that Multimodal Large Language Models often answer incorrectly despite understanding visuals, due to dataset biases and low visual attention, and proposes methods to improve their focus and accuracy.

Contribution

The paper introduces a new benchmark for error analysis in MLLMs and proposes techniques to enhance visual attention and diversify training data.

Findings

01

MLLMs show lower attention to visual tokens compared to question tokens.

02

Bias in instruction tuning datasets affects MLLMs' responses to indirect questions.

03

Proposed methods improve MLLMs' focus on visual content and answer accuracy.

Abstract

Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multi-modal tasks, particularly in visual comprehension. However, we reveal that MLLMs often generate incorrect answers even when they understand the visual content. To this end, we manually construct a benchmark with 12 categories and design evaluation metrics that assess the degree of error in MLLM responses even when the visual content is seemingly understood. Based on this benchmark, we test 15 leading MLLMs and analyze the distribution of attention maps and logits of some MLLMs. Our investigation identifies two primary issues: 1) most instruction tuning datasets predominantly feature questions that 'directly' relate to the visual content, leading to a bias in MLLMs' responses to other indirect questions, and 2) MLLMs' attention to visual tokens is notably lower than to system and question tokens. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

baai-dcai/multimodal-robustness-benchmark
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques

MethodsFocus · Sparse Evolutionary Training