Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Anas Zafar; Leema Krishna Murali; Ashish Vashist

arXiv:2603.03437·cs.CV·March 5, 2026

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Anas Zafar, Leema Krishna Murali, Ashish Vashist

PDF

Open Access

TL;DR

This paper introduces a new evaluation framework for multimodal medical visual question answering, revealing that current training methods often produce ungrounded answers despite high accuracy.

Contribution

It proposes a counterfactual evaluation method and new metrics to assess visual grounding, exposing limitations of accuracy-focused training in medical VQA models.

Findings

01

RLVR improves accuracy but reduces visual grounding.

02

Models generate visual claims in 68-74% of responses.

03

Grounding-aware evaluation is necessary for progress.

Abstract

Recent work shows that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical VQA benchmarks, suggesting current evaluation protocols may fail to measure causal visual dependence. We introduce a counterfactual evaluation framework using real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, we measure Visual Reliance Score (VRS), Image Sensitivity (IS), and introduce Hallucinated Visual Reasoning Rate (HVRR) to detect cases where models generate visual claims despite producing image-invariant answers. Our findings reveal that RLVR improves accuracy while degrading visual grounding: text-only RLVR achieves negative VRS on PathVQA (-0.09), performing better with mismatched images, while image-text RLVR reduces image sensitivity to 39.8% overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)