Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading

Yiqun Wang (Beijing Jiaotong University)

arXiv:2605.09242·eess.IV·May 12, 2026

Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading

Yiqun Wang (Beijing Jiaotong University)

PDF

TL;DR

This paper introduces a novel diffusion-based framework for diabetic retinopathy grading that leverages vision-language models and semantic conditioning to improve accuracy and robustness.

Contribution

It proposes a CLIP-guided semantic diffusion model with domain adaptation and cross-modal semantic conditioning for enhanced DR grading performance.

Findings

01

Achieved 87.5% accuracy on APTOS 2019 dataset.

02

Outperformed existing diffusion-based and visual-only methods.

03

Validated the effectiveness of each module through ablation studies.

Abstract

Automated grading of diabetic retinopathy (DR) faces several critical challenges: subtle inter-grade visual distinctions in fine-grained lesion patterns, distributional discrepancies induced by heterogeneous imaging devices and acquisition conditions, and the inherent inability of purely visual approaches to exploit clinical semantic knowledge. In this paper, we propose CLIP-Guided Semantic Diffusion (CGSD), a DR grading framework that synergistically integrates vision-language pretraining with diffusion probabilistic modeling. We adopt a domain-specific vision-language model tailored for DR grading as the semantic guidance module and adapt it to the target domain via Low-Rank Adaptation (LoRA), effectively bridging the distributional gap between the pretrained model and the target dataset with only a minimal number of trainable parameters. Building on this foundation, we construct a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.