From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models
Pir Bakhsh Khokhar, Carmine Gravino, Fabio Palomba, Sule Yildirim Yayilgan, Sarang Shaikh

TL;DR
This study develops an interpretable deep learning framework for diabetic retinopathy grading, combining CNN-transformer ensembles with visual and textual explanations to improve clinical interpretability.
Contribution
It introduces a multimodal explanation approach using vision-language models conditioned on retinal images and classifier outputs, enhancing interpretability in DR grading.
Findings
Ensembling with weighted soft voting achieved the highest agreement (QWK 0.934).
CNN backbones like ResNet-50 and ConvNeXt-Tiny provided strong baseline performance.
VLM rationales were grade-consistent, offering plausible explanations.
Abstract
The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
