# Evaluation of Large Language Models in the Diagnosis, Urgency Triage, and Initial Management of Ophthalmic Emergencies

**Authors:** Surina Mittal, Yakshi Aggarwal

PMC · DOI: 10.7759/cureus.101433 · Cureus · 2026-01-13

## TL;DR

This study compares three large language models in diagnosing and managing eye emergencies, finding they perform similarly and accurately, but still need human oversight.

## Contribution

The study evaluates the performance of three leading LLMs in ophthalmic emergencies using standardized clinical vignettes.

## Key findings

- All three LLMs achieved diagnostic accuracies above 80% in ophthalmic emergencies.
- No significant differences were found between the models in diagnostic accuracy, urgency triage, or management advice.
- LLMs provided management advice aligned with recognized ophthalmology guidelines.

## Abstract

Introduction

Artificial intelligence (AI) technologies are progressing rapidly and becoming an integral part of how healthcare professionals obtain medical knowledge. Large language models (LLMs) now enable clinicians to have direct access to medical guidance and support in clinical reasoning. In ophthalmology, where prompt identification of sight-threatening symptoms is essential, these tools can offer diagnostic support, urgency triaging, and initial management guidance, thus potentially reducing delays in care and improving referrals. Limited evidence exists regarding their accuracy, reliability, and safety in eye emergencies. This study aims to compare the diagnostic accuracy, urgency triage, and initial management advice generated by the three leading LLMs, to evaluate their prospective role in the early assessment and management of acute eye presentations.

Methods

This cross-sectional study compared the performance of three LLMs, including ChatGPT-5 (2025, OpenAI, San Francisco, CA, USA), Google Gemini 2.5 Pro (2025, Google DeepMind, London, UK), and Claude Opus 4.1 (2025, Anthropic, San Francisco, CA, USA), using a set of 40 standardised ophthalmic emergency vignettes across five key subspecialties within ophthalmology. Each vignette was entered into each LLM, and responses were evaluated for diagnostic accuracy (2 points), urgency recognition (2 points), initial management advice (3 points), and identification of red flag symptoms (1 point). Each vignette case had a minimum possible score of 0 and a maximum possible score of 8. Scores were compared across the three models, and statistical significance was assessed using the Wilcoxon signed-rank and Friedman tests.

Results

In this analysis, 40 clinical vignettes were each evaluated across three LLMs: ChatGPT-5, Gemini 2.5 Pro, and Claude Opus 4.1, producing 120 responses in total. Overall scores were similar across ChatGPT (6.88 ± 1.16), Gemini (7.03 ± 1.21), and Claude (6.93 ± 1.19), with no significant differences identified on statistical analysis. Additional comparison across diagnostic scores, urgency triage, red flag recognition, and management scores yielded no significant differences between any of the LLMs. Further subgroup analysis comparing subspecialties similarly yielded no significant differences across all LLMs.

Conclusion

This study demonstrates that ChatGPT-5, Google Gemini 2.5 Pro, and Claude Opus 4.1 show consistent performances in diagnosing, triaging, and providing management advice for ophthalmic emergencies from text-based clinical vignettes. All models achieved diagnostic accuracies above 80% and provided management advice in line with recognised ophthalmology guidelines, with no statistically significant differences between their overall performances. These findings showcase the potential of LLMs as support tools for ophthalmic clinical advice, particularly for non-specialists, where guidance is valuable. However, their diagnostic errors and consequent suboptimal management advice emphasise the need for ongoing development and human supervision to ensure safety before widespread clinical application.

## Full-text entities

- **Diseases:** eye emergencies (MESH:D004630)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12896728/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12896728/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12896728/full.md

---
Source: https://tomesphere.com/paper/PMC12896728