# Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

**Authors:** Fares Antaki, David Mikhail, Daniel Milad, Danny A. Mammo, Sumit Sharma, Sunil K. Srivastava, Bing Yu Chen, Samir Touma, Mertcan Sevgi, Jonathan El-Khoury, Pearse A. Keane, Qingyu Chen, Yih Chung Tham, Renaud Duval

PMC · DOI: 10.1016/j.xops.2025.101034 · 2025-12-06

## TL;DR

This study evaluates how well GPT-5 models perform on ophthalmology questions, finding that high reasoning effort configurations achieve near-perfect accuracy.

## Contribution

The study introduces an autograder framework for evaluating LLM answers in ophthalmology and benchmarks GPT-5 against prior models.

## Key findings

- GPT-5-high achieved the highest accuracy (0.965) on ophthalmology questions, outperforming prior models.
- GPT-5-mini-low was the most cost-effective high-performance configuration.
- A new autograder framework was developed to assess LLM-generated answers against reference standards.

## Abstract

Novel large language models (LLMs) such as Generative Pretrained Transformer-5 (GPT-5) integrate advanced reasoning capabilities that may enhance performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. Our objective was to evaluate the performance and cost-accuracy trade-offs of OpenAI’s GPT-5 compared with previous generation LLMs on ophthalmic question answering.

Evaluation of diagnostic test or technology.

Generative Pretrained Transformer-5 is a publicly available LLM.

In August 2025, 12 configurations of OpenAI’s GPT-5 series (3 model tiers across 4 reasoning effort settings) were evaluated alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course data set. The study did not include human participants.

The primary outcome was accuracy on the 260-item ophthalmology multiple-choice question set for each model configuration. The secondary outcomes included head-to-head ranking of configurations using a Bradley–Terry model applied to paired win/loss comparisons of answer accuracy, and evaluation of generated natural language rationales using a reference-anchored, pairwise LLM-as-a-judge framework. Additional analyses assessed the accuracy-cost trade-off by calculating mean per-question cost from token usage and identifying Pareto-efficient configurations.

The configuration GPT-5-high achieved the highest accuracy (0.965; 95% confidence interval [CI], 0.942–0.985), significantly outperforming all GPT-5-nano variants (P < 0.001), o1-high (P = 0.04), and GPT-4o (P < 0.001), but not o3-high (0.958; 95% CI, 0.931–0.981). The configuration GPT-5-high ranked first in accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high), as judged by a reference-anchored LLM-as-a-judge autograder. Cost-accuracy analysis identified multiple GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low providing the most optimal low-cost, high-performance configuration.

This study benchmarks the GPT-5 series on a high-quality ophthalmology question-answering data set, demonstrating that GPT-5 with high reasoning effort achieved near-perfect accuracy and outperformed prior reasoning LLMs. This study also introduces an autograder framework for scalable, automated evaluation of LLM-generated answers against reference standards in ophthalmology.

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

## Full-text entities

- **Chemicals:** GPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12811449/full.md

---
Source: https://tomesphere.com/paper/PMC12811449