Assessment of Correctness, Content Omission, and Risk of Harm in Large Language Model Responses to Ophthalmology Continuing Medical Education Questions

Jacqueline L. Chen; Amanda J. Lu; Rohan Verma; Li Wang; Douglas D. Koch; Allison J. Chen

PMC · DOI:10.1016/j.xops.2026.101130·February 26, 2026

Assessment of Correctness, Content Omission, and Risk of Harm in Large Language Model Responses to Ophthalmology Continuing Medical Education Questions

Jacqueline L. Chen, Amanda J. Lu, Rohan Verma, Li Wang, Douglas D. Koch, Allison J. Chen

PDF

Open Access

TL;DR

This study evaluates how accurately large language models answer ophthalmology questions and finds that while they are often correct, they can also provide harmful or incomplete information.

Contribution

The study introduces a standardized evaluation of LLMs in ophthalmology education, highlighting risks of incorrect reasoning and potential harm.

Findings

01

ChatGPT-4 had an 82.5% accuracy rate in multiple-choice answers, while Gemini Pro 1.5 had 49.2%.

02

Both models showed significant issues in prose responses, including incorrect reasoning and potential harm.

03

The study recommends provider-guided auditing before using LLMs in patient-facing settings.

Abstract

To evaluate the accuracy and prose responses of 2 large language models (LLMs) to ophthalmology continuing medical education questions. Question prompts and multiple choice (MC) answer options were input into the 2 LLMs, and responses were analyzed for accuracy and assessed for evidence of correctness, completeness, bias, and potential harm using a previously reported standardized rubric. Basic and Clinical Science Course questions and MC answer options from the American Academy of Ophthalmology question bank were used as inputs into the 2 LLMs (ChatGPT-4 and Google Vertex’s Gemini Pro 1.5). The MC responses were assessed for accuracy in comparison to the question bank’s designated corrected answer. The free-text prose responses from the 2 LLMs were assessed by 3 board-certified ophthalmologists. Accuracy and assessment of correct and incorrect reasoning, inappropriate content,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Categorization, perception, and language · Genomics and Rare Diseases