Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

Pengfei Wang; Guohai Xu; Weinong Wang; Junjie Yang; Jie Lou; Yunhua Xue

arXiv:2505.10541·cs.CV·May 26, 2025

Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

Pengfei Wang, Guohai Xu, Weinong Wang, Junjie Yang, Jie Lou, Yunhua Xue

PDF

Open Access 1 Repo

TL;DR

This paper investigates how multimodal large language models may appear to understand images correctly while actually lacking true visual comprehension, by analyzing attention mechanisms and proposing new evaluation metrics and benchmarks.

Contribution

It introduces the concept of implicit visual misunderstanding, develops attention-based metrics for visual understanding, and creates a benchmark to quantify and analyze these misunderstandings.

Findings

01

Attention distribution converges on correct images in deeper layers.

02

Attention accuracy is robust to positional biases.

03

Method extends effectively to unimodal scenarios.

Abstract

Recent advancements have enhanced the capability of Multimodal Large Language Models (MLLMs) to comprehend multi-image information. However, existing benchmarks primarily evaluate answer correctness, overlooking whether models genuinely comprehend the visual input. To address this, we define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input. Through our analysis, we decouple the visual and textual modalities within the causal attention module, revealing that attention distribution increasingly converges on the image associated with the correct answer as the network layers deepen. This insight leads to the introduction of a scale-agnostic metric, \textit{attention accuracy}, and a novel benchmark for quantifying IVMs. Attention accuracy directly evaluates the model's visual understanding via internal mechanisms,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

welldonepf/stme
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need