Assessment of Correctness, Content Omission, and Risk of Harm in Large Language Model Responses to Ophthalmology Continuing Medical Education Questions
Jacqueline L. Chen, Amanda J. Lu, Rohan Verma, Li Wang, Douglas D. Koch, Allison J. Chen

TL;DR
This study evaluates how accurately large language models answer ophthalmology questions and finds that while they are often correct, they can also provide harmful or incomplete information.
Contribution
The study introduces a standardized evaluation of LLMs in ophthalmology education, highlighting risks of incorrect reasoning and potential harm.
Findings
ChatGPT-4 had an 82.5% accuracy rate in multiple-choice answers, while Gemini Pro 1.5 had 49.2%.
Both models showed significant issues in prose responses, including incorrect reasoning and potential harm.
The study recommends provider-guided auditing before using LLMs in patient-facing settings.
Abstract
To evaluate the accuracy and prose responses of 2 large language models (LLMs) to ophthalmology continuing medical education questions. Question prompts and multiple choice (MC) answer options were input into the 2 LLMs, and responses were analyzed for accuracy and assessed for evidence of correctness, completeness, bias, and potential harm using a previously reported standardized rubric. Basic and Clinical Science Course questions and MC answer options from the American Academy of Ophthalmology question bank were used as inputs into the 2 LLMs (ChatGPT-4 and Google Vertex’s Gemini Pro 1.5). The MC responses were assessed for accuracy in comparison to the question bank’s designated corrected answer. The free-text prose responses from the 2 LLMs were assessed by 3 board-certified ophthalmologists. Accuracy and assessment of correct and incorrect reasoning, inappropriate content,…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Categorization, perception, and language · Genomics and Rare Diseases
