MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

Ruiqi Wu; Na Su; Chenran Zhang; Tengfei Ma; Tao Zhou; Zhiting Cui; Nianfeng Tang; Tianyu Mao; Yi Zhou; Wen Fan; Tianxing Wu; Shenqi Jing; Huazhu Fu

arXiv:2501.15798·cs.CV·August 26, 2025·2 cites

MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

Ruiqi Wu, Na Su, Chenran Zhang, Tengfei Ma, Tao Zhou, Zhiting Cui, Nianfeng Tang, Tianyu Mao, Yi Zhou, Wen Fan, Tianxing Wu, Shenqi Jing, Huazhu Fu

PDF

Open Access 1 Repo

TL;DR

This paper introduces MM-Retinal V2, a high-quality fundus image-text dataset, and KeepFIT V2, a novel vision-language pretraining model that effectively transfers elite knowledge into public datasets, enhancing fundus image analysis.

Contribution

The work presents a new dataset and a pretraining method that integrates knowledge transfer techniques, improving fundus vision-language models without relying on large private datasets.

Findings

01

Achieves competitive performance with state-of-the-art models

02

Demonstrates strong generalization in zero-shot and few-shot tasks

03

Provides publicly available dataset and model for research

Abstract

Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lxirich/mm-retinal
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification

MethodsSoftmax · Attention Is All You Need · Contrastive Learning