How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

Guimeng Liu; Tianze Yu; Somayeh Ebrahimkhani; Lin Zhi Zheng Shawn; Kok Pin Ng; Ngai-Man Cheung

arXiv:2603.14323·cs.CV·March 17, 2026

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

Guimeng Liu, Tianze Yu, Somayeh Ebrahimkhani, Lin Zhi Zheng Shawn, Kok Pin Ng, Ngai-Man Cheung

PDF

Open Access

TL;DR

This paper investigates why medical multimodal large language models (MLLMs) underperform in medical image interpretation, revealing their failure to ground predictions in relevant image regions, and proposes a simple inference-time method to improve this grounding, achieving state-of-the-art results.

Contribution

It introduces VGMED, a new dataset for evaluating visual grounding in medical MLLMs, and proposes VGRefine, an inference-time technique that enhances grounding without extra training.

Findings

01

Medical MLLMs often fail to ground predictions in relevant image regions.

02

The proposed VGRefine method improves visual grounding performance.

03

Medical MLLMs' grounding failures are specific to medical images, unlike natural scenes.

Abstract

Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks, particularly in zero-shot settings where generalization is critical, remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. In this work, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle visual grounding from semantic grounding, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. We introduce new quantitative metrics and conduct detailed qualitative analyses. Our study across eight state-of-the-art (SOTA) medical MLLMs validates that they…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Healthcare and Education