Performance of GPT-5 Frontier Models in Ophthalmology Question Answering
Fares Antaki, David Mikhail, Daniel Milad, Danny A Mammo, Sumit Sharma, Sunil K Srivastava, Bing Yu Chen, Samir Touma, Mertcan Sevgi, Jonathan El-Khoury, Pearse A Keane, Qingyu Chen, Yih Chung Tham, Renaud Duval

TL;DR
This study benchmarks GPT-5 models on ophthalmology questions, showing high accuracy and cost-efficiency trade-offs, and introduces an evaluation framework for LLM performance in medical QA tasks.
Contribution
It provides the first comprehensive performance assessment of GPT-5 in ophthalmology question answering and introduces a scalable autograder framework for evaluating LLM-generated answers.
Findings
GPT-5-high achieved 96.5% accuracy.
GPT-5-mini-low offers the best cost-performance balance.
GPT-5 models outperform GPT-4 and some configurations on accuracy and rationale quality.
Abstract
Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI's GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
