LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models
Zhenyue Qin, Yu Yin, Dylan Campbell, Xuansheng Wu, Ke Zou, Yih-Chung, Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen

TL;DR
This paper introduces LMOD, a comprehensive ophthalmology dataset and benchmark for evaluating large vision-language models, revealing significant performance gaps and failure modes in current models compared to supervised methods.
Contribution
The paper presents LMOD, a large-scale multimodal ophthalmology benchmark, and evaluates 13 LVLMs, highlighting their limitations and the need for specialized ophthalmology models.
Findings
LVLMs perform significantly worse in ophthalmology tasks.
Six major failure modes identified in LVLMs.
Supervised models outperform LVLMs in accuracy.
Abstract
The prevalence of vision-threatening eye diseases is a significant global burden, with many cases remaining undiagnosed or diagnosed too late for effective treatment. Large vision-language models (LVLMs) have the potential to assist in understanding anatomical information, diagnosing eye diseases, and drafting interpretations and follow-up plans, thereby reducing the burden on clinicians and improving access to eye care. However, limited benchmarks are available to assess LVLMs' performance in ophthalmology-specific applications. In this study, we introduce LMOD, a large-scale multimodal ophthalmology benchmark consisting of 21,993 instances across (1) five ophthalmic imaging modalities: optical coherence tomography, color fundus photographs, scanning laser ophthalmoscopy, lens photographs, and surgical scenes; (2) free-text, demographic, and disease biomarker information; and (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Biomedical Text Mining and Ontologies
