ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model
Shengzhu Yang, Jiawei Du, Jia Guo, Weihang Zhang, Hanruo Liu, Huiqi Li, and Ningli Wang

TL;DR
ViLReF is a novel vision-language retinal foundation model that leverages expert knowledge and a new loss function to improve zero-shot and transfer learning in retinal image analysis.
Contribution
The paper introduces a retinal foundation model pre-trained with expert knowledge and a Weighted Similarity Coupling Loss, addressing false negatives and enhancing learning.
Findings
Demonstrates strong zero-shot performance on retinal classification tasks.
Shows effective transfer learning capabilities across multiple datasets.
Validates the proposed pre-training strategy's superiority over existing methods.
Abstract
Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model's learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRetinal Imaging and Analysis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
