Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation

Abdelrahman Zaian; Sheethal Bhat; Mohamed Abdalkader; and Andreas Maier

arXiv:2605.06173·cs.CV·May 11, 2026

Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation

Abdelrahman Zaian, Sheethal Bhat, Mohamed Abdalkader, and Andreas Maier

PDF

TL;DR

Retina-RAG is a modular vision-language framework that jointly diagnoses retinal diseases and generates clinical reports, leveraging retrieval-augmented knowledge to improve accuracy and reduce hallucinations.

Contribution

It introduces a low-cost, flexible architecture combining retinal classification, vision-language modeling, and retrieval-augmented generation for comprehensive retinal analysis.

Findings

01

Achieves F1-score of 0.731 for DR severity grading

02

Attains ROUGE-L 0.438 and SBERT similarity 0.884 for report generation

03

Outperforms baseline models significantly on retinal disease detection

Abstract

Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.