Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Fares Antaki; David Mikhail; Daniel Milad; Danny A Mammo; Sumit Sharma; Sunil K Srivastava; Bing Yu Chen; Samir Touma; Mertcan Sevgi; Jonathan El-Khoury; Pearse A Keane; Qingyu Chen; Yih Chung Tham; Renaud Duval

arXiv:2508.09956·cs.CL·August 15, 2025

Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Fares Antaki, David Mikhail, Daniel Milad, Danny A Mammo, Sumit Sharma, Sunil K Srivastava, Bing Yu Chen, Samir Touma, Mertcan Sevgi, Jonathan El-Khoury, Pearse A Keane, Qingyu Chen, Yih Chung Tham, Renaud Duval

PDF

TL;DR

This study benchmarks GPT-5 models on ophthalmology questions, showing high accuracy and cost-efficiency trade-offs, and introduces an evaluation framework for LLM performance in medical QA tasks.

Contribution

It provides the first comprehensive performance assessment of GPT-5 in ophthalmology question answering and introduces a scalable autograder framework for evaluating LLM-generated answers.

Findings

01

GPT-5-high achieved 96.5% accuracy.

02

GPT-5-mini-low offers the best cost-performance balance.

03

GPT-5 models outperform GPT-4 and some configurations on accuracy and rationale quality.

Abstract

Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI's GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.