Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation
Abdelrahman Zaian, Sheethal Bhat, Mohamed Abdalkader, and Andreas Maier

TL;DR
Retina-RAG is a modular vision-language framework that jointly diagnoses retinal diseases and generates clinical reports, leveraging retrieval-augmented knowledge to improve accuracy and reduce hallucinations.
Contribution
It introduces a low-cost, flexible architecture combining retinal classification, vision-language modeling, and retrieval-augmented generation for comprehensive retinal analysis.
Findings
Achieves F1-score of 0.731 for DR severity grading
Attains ROUGE-L 0.438 and SBERT similarity 0.884 for report generation
Outperforms baseline models significantly on retinal disease detection
Abstract
Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
