ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Shengzhu Yang; Jiawei Du; Jia Guo; Weihang Zhang; Hanruo Liu; Huiqi Li; and Ningli Wang

arXiv:2408.10894·cs.CV·September 23, 2025

ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Shengzhu Yang, Jiawei Du, Jia Guo, Weihang Zhang, Hanruo Liu, Huiqi Li, and Ningli Wang

PDF

Open Access 1 Repo

TL;DR

ViLReF is a novel vision-language retinal foundation model that leverages expert knowledge and a new loss function to improve zero-shot and transfer learning in retinal image analysis.

Contribution

The paper introduces a retinal foundation model pre-trained with expert knowledge and a Weighted Similarity Coupling Loss, addressing false negatives and enhancing learning.

Findings

01

Demonstrates strong zero-shot performance on retinal classification tasks.

02

Shows effective transfer learning capabilities across multiple datasets.

03

Validates the proposed pre-training strategy's superiority over existing methods.

Abstract

Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model's learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

t6yang/vilref
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Imaging and Analysis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings