Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang,, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, Bo Zhao

TL;DR
This paper reveals that Multimodal Large Language Models often answer incorrectly despite understanding visuals, due to dataset biases and low visual attention, and proposes methods to improve their focus and accuracy.
Contribution
The paper introduces a new benchmark for error analysis in MLLMs and proposes techniques to enhance visual attention and diversify training data.
Findings
MLLMs show lower attention to visual tokens compared to question tokens.
Bias in instruction tuning datasets affects MLLMs' responses to indirect questions.
Proposed methods improve MLLMs' focus on visual content and answer accuracy.
Abstract
Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multi-modal tasks, particularly in visual comprehension. However, we reveal that MLLMs often generate incorrect answers even when they understand the visual content. To this end, we manually construct a benchmark with 12 categories and design evaluation metrics that assess the degree of error in MLLM responses even when the visual content is seemingly understood. Based on this benchmark, we test 15 leading MLLMs and analyze the distribution of attention maps and logits of some MLLMs. Our investigation identifies two primary issues: 1) most instruction tuning datasets predominantly feature questions that 'directly' relate to the visual content, leading to a bias in MLLMs' responses to other indirect questions, and 2) MLLMs' attention to visual tokens is notably lower than to system and question tokens. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
MethodsFocus · Sparse Evolutionary Training
