Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis
Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra

TL;DR
This paper introduces a knowledge-enhanced multimodal transformer framework that integrates retinal images, clinical text, and patient data to improve cross-modal retrieval and diagnosis of diabetic retinopathy, surpassing existing models.
Contribution
It presents a novel multimodal transformer architecture with knowledge integration for medical image-text alignment, achieving state-of-the-art results in diabetic retinopathy diagnosis and retrieval.
Findings
Near-perfect text-to-image retrieval with 99.94% Recall@1
Achieves 97.05% accuracy in SDRG classification
Demonstrates strong zero-shot generalization on unseen datasets
Abstract
Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRetinal Imaging and Analysis · Retinal Diseases and Treatments · Multimodal Machine Learning Applications
