# How Far Have Large Language Models Advanced in Ophthalmology? A Systematic Review of Their Development, Evaluation, and Readiness for Clinical Use

**Authors:** Hyunjae Kim, Yu Yin, Zhiyuan Cao, Chen Liu, Anran Li, Zhen Chen, Xuguang Ai, Younjoon Chung, Fan Ma, Xueping Peng, Lingfei Qian, Zhenyue Qin, Kalpana Raja, Yang Ren, Weipeng Zhou, Yih-Chung Tham, Emily Y. Chew, Zhiyong Lu, Sophia Y. Wang, Hua Xu, Qingyu Chen

PMC · DOI: 10.21203/rs.3.rs-8819770/v1 · Research Square · 2026-02-10

## TL;DR

This paper reviews how large language models are being used in ophthalmology, highlighting their current applications, evaluation methods, and readiness for real-world clinical use.

## Contribution

The study introduces a structured framework to categorize and evaluate LLMs in ophthalmology, revealing gaps in clinical validation and reproducibility.

## Key findings

- Most studies use general-purpose models like GPT-4 and Gemini, with few domain-specific adaptations.
- Multimodal LLMs integrating imaging data are underexplored, and evaluation often lacks real-world clinical validation.
- Only a small fraction of studies provide reproducible results or use publicly available datasets.

## Abstract

Large language models (LLMs) are rapidly transforming ophthalmology, with expanding applications in patient care, clinical documentation, and medical education. Recent studies span a wide range of use cases—from early text-only applications to emerging multimodal systems that integrate ophthalmic images to support diagnosis and generate assessment and treatment plans. Amid this rapid progress, it is critical for both researchers and clinicians to stay informed in order to guide responsible development and adoption. However, prior reviews have largely focused on narrow domains such as an inventory of potential use cases or performance on board-style examinations, leaving the broader landscape insufficiently characterized. Key questions remain unanswered: How are LLMs in ophthalmology being developed? What applications and evaluation strategies are being pursued? And which areas are closest to real-world clinical adoption? To date, these aspects have not been comprehensively examined.

In this study, we conducted a systematic review on LLMs in ophthalmology by manually screening 1, 029 studies from PubMed/PMC, Scopus, and Embase published between January 1, 2022, and April 1, 2025, identifying 91 relevant articles. To provide a standardized assessment, we introduced a structured framework that categorizes ophthalmic use cases and stratifies evaluation rigor across five levels of maturity. Each study was manually annotated using 27 structured variables spanning multiple dimensions: scope and purpose (e.g., study aim, ophthalmic subspecialty, input modality); model architecture and training (e.g., backbone LLMs, domain-specific adaptations); evaluation and validation (e.g., target applications, evaluation metrics, level of clinical validation); and resource availability (e.g., model access, licensing, dataset availability). We additionally performed a small-scale, illustrative evaluation of representative emerging models, such as GPT-5.2, gpt-oss-120B, and Gemini 3, to contextualize previously reported results on commonly used ophthalmology tasks.

The results show that most studies focused on general-purpose proprietary models, such as GPT-4 and Gemini, while fewer than 10% introduced domain-specific adaptations for ophthalmology, including only 4% that developed ophthalmology-specific architectures for text-based applications. Multimodal LLMs remain relatively underexplored, with only 23% of studies incorporating imaging data. Evaluation practices reveal a significant translational gap: While 57.1% of studies relied on standard benchmarking and expert review, only 9.9% conducted retrospective validation using real-world clinical data, and just two studies progressed to prospective pilot evaluation. Moreover, although model performance on benchmarks on board-style exams and clinical vignettes has improved with newer model generations, reproducibility and transparency remain limited: only 5.5% of studies released evaluation code, and 33% used publicly available datasets. Finally, we provide a living repository to track the rapid progress of LLMs in ophthalmology for the broader research and clinical community.

## Full-text entities

- **Diseases:** vision impairment (MESH:D014786), AMD (MESH:D008268), LLMs (MESH:D007806), hallucinations (MESH:D006212), DR (MESH:D004370), Diabetic retinopathy (MESH:D003930), glaucoma (MESH:D005901), eye diseases (MESH:D005128)
- **Chemicals:** FFA (MESH:D005230), GPT-4V (-), fluorescein (MESH:D019793)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12919167/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12919167/full.md

## References

123 references — full list in the complete paper: https://tomesphere.com/paper/PMC12919167/full.md

---
Source: https://tomesphere.com/paper/PMC12919167