Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

Xingyue Wang; Bo Liu; Meng Wang; Zhixuan Zhang; Chengcheng Zhu; Huazhu Fu; Jiang Liu

arXiv:2605.22414·cs.CV·May 22, 2026

Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

Xingyue Wang, Bo Liu, Meng Wang, Zhixuan Zhang, Chengcheng Zhu, Huazhu Fu, Jiang Liu

PDF

TL;DR

This paper introduces FundusGround, a new benchmark for ophthalmic VQA that emphasizes clinical interpretability through spatially-grounded lesion evidence, improving model transparency and reliability.

Contribution

It presents a large, annotated dataset with spatial lesion localization and questions, and benchmarks models on answer accuracy and lesion reasoning for clinical VQA.

Findings

01

Lesion-level evidence improves model performance.

02

Structured lesion annotations enable standardized mapping to retinal regions.

03

Models with lesion grounding show enhanced transparency.

Abstract

Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.