# Assessment of Correctness, Content Omission, and Risk of Harm in Large Language Model Responses to Ophthalmology Continuing Medical Education Questions

**Authors:** Jacqueline L. Chen, Amanda J. Lu, Rohan Verma, Li Wang, Douglas D. Koch, Allison J. Chen

PMC · DOI: 10.1016/j.xops.2026.101130 · 2026-02-26

## TL;DR

This study evaluates how accurately large language models answer ophthalmology questions and finds that while they are often correct, they can also provide harmful or incomplete information.

## Contribution

The study introduces a standardized evaluation of LLMs in ophthalmology education, highlighting risks of incorrect reasoning and potential harm.

## Key findings

- ChatGPT-4 had an 82.5% accuracy rate in multiple-choice answers, while Gemini Pro 1.5 had 49.2%.
- Both models showed significant issues in prose responses, including incorrect reasoning and potential harm.
- The study recommends provider-guided auditing before using LLMs in patient-facing settings.

## Abstract

To evaluate the accuracy and prose responses of 2 large language models (LLMs) to ophthalmology continuing medical education questions.

Question prompts and multiple choice (MC) answer options were input into the 2 LLMs, and responses were analyzed for accuracy and assessed for evidence of correctness, completeness, bias, and potential harm using a previously reported standardized rubric.

Basic and Clinical Science Course questions and MC answer options from the American Academy of Ophthalmology question bank were used as inputs into the 2 LLMs (ChatGPT-4 and Google Vertex’s Gemini Pro 1.5).

The MC responses were assessed for accuracy in comparison to the question bank’s designated corrected answer. The free-text prose responses from the 2 LLMs were assessed by 3 board-certified ophthalmologists.

Accuracy and assessment of correct and incorrect reasoning, inappropriate content, missing content, possibility of bias, or possibility of harm.

The MC accuracy rates of ChatGPT-4 and Gemini Pro 1.5 were 82.5% (99/120) and 49.2% (59/120) (P < 0.05), respectively. Though there was high evidence of correct reasoning in the prose responses (92% and 88% for ChatGPT-4 and Gemini Pro 1.5, respectively), there was also evidence of incorrect reasoning (42% and 58%), inappropriate content (29% and 36%), missing content (42% and 30%), and possibility of physical or emotional harm (36% and 44%).

Though ChatGPT-4 was able to perform well in MC accuracy, both LLMs contained inaccuracies, missing content, and material that could lead to harm in their prose responses. Our findings suggest that provider-guided auditing in ophthalmology is required before the use of the technology in direct patient-facing settings.

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13019321/full.md

---
Source: https://tomesphere.com/paper/PMC13019321