Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning
Anas Zafar, Leema Krishna Murali, Ashish Vashist

TL;DR
This paper introduces a new evaluation framework for multimodal medical visual question answering, revealing that current training methods often produce ungrounded answers despite high accuracy.
Contribution
It proposes a counterfactual evaluation method and new metrics to assess visual grounding, exposing limitations of accuracy-focused training in medical VQA models.
Findings
RLVR improves accuracy but reduces visual grounding.
Models generate visual claims in 68-74% of responses.
Grounding-aware evaluation is necessary for progress.
Abstract
Recent work shows that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical VQA benchmarks, suggesting current evaluation protocols may fail to measure causal visual dependence. We introduce a counterfactual evaluation framework using real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, we measure Visual Reliance Score (VRS), Image Sensitivity (IS), and introduce Hallucinated Visual Reasoning Rate (HVRR) to detect cases where models generate visual claims despite producing image-invariant answers. Our findings reveal that RLVR improves accuracy while degrading visual grounding: text-only RLVR achieves negative VRS on PathVQA (-0.09), performing better with mismatched images, while image-text RLVR reduces image sensitivity to 39.8% overall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
