Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs

Dung Nguyen; Minh Khoi Ho; Huy Ta; Thanh Tam Nguyen; Qi Chen; Kumar Rav; Quy Duong Dang; Satwik Ramchandre; Son Lam Phung; Zhibin Liao; Minh-Son To; Johan Verjans; Phi Le Nguyen; and Vu Minh Hieu Phan

arXiv:2505.00744·cs.CV·March 17, 2026

Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs

Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, Quy Duong Dang, Satwik Ramchandre, Son Lam Phung, Zhibin Liao, Minh-Son To, Johan Verjans, Phi Le Nguyen, and Vu Minh Hieu Phan

PDF

Open Access

TL;DR

This paper introduces HEAL-MedVQA, a benchmark for evaluating localization and hallucination in medical multimodal models, and proposes the LobA framework to improve their grounded reasoning and answer reliability.

Contribution

It presents a new benchmark with evaluation protocols and a large dataset for localization assessment, along with a training framework that enhances model grounding and reduces hallucinations.

Findings

01

LobA significantly improves localization accuracy.

02

Models trained with LobA produce more reliable, grounded answers.

03

HEAL-MedVQA effectively evaluates hallucination robustness in medical LMMs.

Abstract

Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguistic patterns or attend to irrelevant image areas when responding to disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive benchmark designed to evaluate LMMs' localization abilities and hallucination robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, with doctor-annotated anatomical segmentation masks for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSeismology and Earthquake Studies