Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence
Xingyue Wang, Bo Liu, Meng Wang, Zhixuan Zhang, Chengcheng Zhu, Huazhu Fu, Jiang Liu

TL;DR
This paper introduces FundusGround, a new benchmark for ophthalmic VQA that emphasizes clinical interpretability through spatially-grounded lesion evidence, improving model transparency and reliability.
Contribution
It presents a large, annotated dataset with spatial lesion localization and questions, and benchmarks models on answer accuracy and lesion reasoning for clinical VQA.
Findings
Lesion-level evidence improves model performance.
Structured lesion annotations enable standardized mapping to retinal regions.
Models with lesion grounding show enhanced transparency.
Abstract
Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
