Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning

Xinyao Liu; Diping Song

arXiv:2507.17539·cs.AI·July 24, 2025

Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning

Xinyao Liu, Diping Song

PDF

Open Access 3 Models 1 Datasets

TL;DR

This paper presents FundusExpert, an ophthalmology-specific multimodal large language model with integrated reasoning capabilities, trained on a new dataset FundusGen, achieving superior performance in clinical ophthalmic tasks and demonstrating the importance of data quality and cognitive alignment.

Contribution

Introduction of FundusExpert, a specialized MLLM for ophthalmology, and FundusGen, a dataset with clinical reasoning annotations, advancing cross-modal understanding and interpretability in medical AI.

Findings

01

FundusExpert surpasses 40B MedRegA accuracy by 26.6%.

02

Achieves 77.0% clinical consistency in zero-shot report generation.

03

Demonstrates a data quality-capability scaling law ($L \,\propto\, N^{0.068}$).

Abstract

Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

MeteorElf/Fundus-MMBench
dataset· 52 dl
52 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Electronic Health Records Systems · Speech and dialogue systems