# Evolving Consultation: Enhancing Ophthalmic Diagnostic Performance Using Large Language Model

**Authors:** Taiga Inooka, Hikaru Ota, Yosuke Taki, Sayuri Yasuda, Ai Fujita Sajiki, Ayana Suzumura, Hideyuki Shimizu, Jun Takeuchi, Ryo Tomita, Taro Kominami, Hiroaki Ushida, Kenya Yuki, Koji M. Nishiguchi

PMC · DOI: 10.1016/j.xops.2025.101004 · Ophthalmology Science · 2025-11-11

## TL;DR

This study shows that ChatGPT-4o can improve ophthalmologists' diagnostic responses, especially for residents, but with some risks of incorrect citations.

## Contribution

The study is the first to evaluate how ChatGPT-4o impacts ophthalmologists' diagnostic accuracy and variability in clinical reasoning.

## Key findings

- ChatGPT-4o improved coherency, comprehensiveness, and safety scores for both residents and board-certified ophthalmologists.
- ChatGPT-4o increased citation frequency but introduced many incorrect or hallucinated references.
- Factuality scores did not improve and showed increased variability after ChatGPT-4o assistance.

## Abstract

Artificial intelligence–powered large language models (LLMs) are increasingly applied in health care. However, studies in ophthalmology assessing whether LLMs can improve the accuracy of complex differential diagnoses in clinical cases, or which levels of clinical experience benefit most from their use, remain lacking. This study assessed the effectiveness of ChatGPT-4o, an LLM-driven chatbot, in enhancing ophthalmologists' clinical reasoning using original scenarios.

Prospective study.

Ten original ophthalmic clinical scenarios with open-ended questions were developed, covering the following subspecialties: oculoplastic and orbital disease, glaucoma, inherited retinal disease, macular disease, neuro-ophthalmology, ocular surface, pediatric ophthalmology, retinal vascular disease, strabismus, and uveitis.

Responses to each clinical scenario were collected from 20 ophthalmologists (10 residents and 10 board-certified ophthalmologists) and ChatGPT-4o. Ophthalmologists subsequently revised their answers with assistance from ChatGPT-4o. All responses were anonymized and independently evaluated by 3 attending ophthalmologists based on 4 metrics: coherency, factuality, comprehensiveness, and safety (each on a 5-point scale).

The median total scores for each group in coherency, factuality, comprehensiveness, and safety (maximum of 15 points each).

Assistance from ChatGPT-4o significantly improved evaluation scores for coherency, comprehensiveness, and safety among both residents and board-certified ophthalmologists (all, P < 0.001). However, factuality scores showed no significant improvements (P = 0.114 and 0.839, respectively). Although ChatGPT-4o assistance increased citation frequency (residents: 0.24–0.98 per response, board-certified ophthalmologists: 0.12–0.68 per response, both P < 0.05), approximately 44% of these additional citations were identified as hallucinated references, nonexistent, or incorrect citations. Notably, ChatGPT-4o assistance led to a significant increase in variability for factuality and safety scores in both groups (Brown–Forsythe test, all P < 0.05), whereas it decreased variability for coherency and comprehensiveness, with the reduction statistically significant among residents (P = 0.008 and P = 0.006, respectively).

ChatGPT-4o effectively enhanced diagnostic reasoning and response quality, particularly among ophthalmology residents. However, successful integration into clinical education and practice requires careful management of increased variability in factuality and safety. This issue could be addressed by implementing strategies such as advanced retrieval-augmented generation systems to ensure the provision of accurate and safe clinical information.

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

## Full-text entities

- **Diseases:** uveitis (MESH:D014605), fatigue (MESH:D005221), strabismus (MESH:D013285), LLMs (MESH:D007806), age-related macular degeneration (MESH:D008268), oculoplastic and orbital disease (MESH:D009916), Glaucoma (MESH:D005901), inherited retinal disease (MESH:D012164), hallucinated (MESH:D006212), dry-eye (MESH:D015352), diabetic retinopathy (MESH:D003930)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12919258/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12919258/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12919258/full.md

---
Source: https://tomesphere.com/paper/PMC12919258