TL;DR
Fundus-R1 is a knowledge-aware, reasoning-enhanced multimodal large language model trained solely on public datasets for improved fundus image understanding and diagnosis.
Contribution
It introduces a RAG-based reasoning trace generation method and enhances RLVR with process rewards, enabling effective training on publicly available data.
Findings
Fundus-R1 outperforms baseline models on three fundus-reading benchmarks.
The RAG-based reasoning traces improve the model's interpretability and accuracy.
Self-consistency rewards enhance the reasoning quality during training.
Abstract
Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
