Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Fares Antaki; David Mikhail; Daniel Milad; Danny A. Mammo; Sumit Sharma; Sunil K. Srivastava; Bing Yu Chen; Samir Touma; Mertcan Sevgi; Jonathan El-Khoury; Pearse A. Keane; Qingyu Chen; Yih Chung Tham; Renaud Duval

PMC · DOI:10.1016/j.xops.2025.101034·December 6, 2025

Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Fares Antaki, David Mikhail, Daniel Milad, Danny A. Mammo, Sumit Sharma, Sunil K. Srivastava, Bing Yu Chen, Samir Touma, Mertcan Sevgi, Jonathan El-Khoury, Pearse A. Keane, Qingyu Chen, Yih Chung Tham, Renaud Duval

PDF

Open Access

TL;DR

This study evaluates how well GPT-5 models perform on ophthalmology questions, finding that high reasoning effort configurations achieve near-perfect accuracy.

Contribution

The study introduces an autograder framework for evaluating LLM answers in ophthalmology and benchmarks GPT-5 against prior models.

Findings

01

GPT-5-high achieved the highest accuracy (0.965) on ophthalmology questions, outperforming prior models.

02

GPT-5-mini-low was the most cost-effective high-performance configuration.

03

A new autograder framework was developed to assess LLM-generated answers against reference standards.

Abstract

Novel large language models (LLMs) such as Generative Pretrained Transformer-5 (GPT-5) integrate advanced reasoning capabilities that may enhance performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. Our objective was to evaluate the performance and cost-accuracy trade-offs of OpenAI’s GPT-5 compared with previous generation LLMs on ophthalmic question answering. Evaluation of diagnostic test or technology. Generative Pretrained Transformer-5 is a publicly available LLM. In August 2025, 12 configurations of OpenAI’s GPT-5 series (3 model tiers across 4 reasoning effort settings) were evaluated alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

GPT-4o

Figures3

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Topic Modeling