Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

Argha Kamal Samanta; Harshika Goyal; Vasudha Joshi; Tushar Mungle; Pabitra Mitra

arXiv:2512.19663·cs.CV·December 23, 2025

Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra

PDF

Open Access

TL;DR

This paper introduces a knowledge-enhanced multimodal transformer framework that integrates retinal images, clinical text, and patient data to improve cross-modal retrieval and diagnosis of diabetic retinopathy, surpassing existing models.

Contribution

It presents a novel multimodal transformer architecture with knowledge integration for medical image-text alignment, achieving state-of-the-art results in diabetic retinopathy diagnosis and retrieval.

Findings

01

Near-perfect text-to-image retrieval with 99.94% Recall@1

02

Achieves 97.05% accuracy in SDRG classification

03

Demonstrates strong zero-shot generalization on unseen datasets

Abstract

Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Imaging and Analysis · Retinal Diseases and Treatments · Multimodal Machine Learning Applications