# Comparative performance of GPT-4, GPT-o3, GPT-5, Gemini-3-Flash, and DeepSeek-R1 in ophthalmology question answering

**Authors:** Ping Zhang, Jiaoman Wang, Xinya Hu, Xiaoqing Wang, Xianming Fan, Wei Chi, Weihua Yang

PMC · DOI: 10.3389/fcell.2026.1744389 · Frontiers in Cell and Developmental Biology · 2026-01-29

## TL;DR

This study compares the performance of several large language models in answering ophthalmology questions, finding that Gemini-3-Flash and GPT-o3 perform best in accuracy and consistency.

## Contribution

The paper provides a systematic comparison of multiple state-of-the-art LLMs in ophthalmology QA tasks, focusing on accuracy and consistency.

## Key findings

- Gemini-3-Flash achieved the highest overall accuracy (83.3%) in ophthalmology question answering.
- GPT-o3 showed the highest decision stability (κ = 0.966) among the tested models.
- Prompt engineering had limited impact on performance for closed-ended medical questions.

## Abstract

The application of large language models (LLMs) in medicine is rapidly advancing, showing particular promise in specialized fields like ophthalmology. However, existing research has predominantly focused on validating individual models, with a notable scarcity of systematic comparisons between multiple state-of-the-art LLMs.

To systematically evaluate the performance of GPT-4, GPT-o3, GPT-5, Gemini-3-Flash, and DeepSeek-R1 on ophthalmology question-answering tasks, with a specific focus on response consistency and factual accuracy.

A total of 300 single-best-answer multiple-choice questions were sampled from the StatPearls ophthalmology question bank. The questions were categorized into four difficulty levels (Levels 1–4) based on the inherent difficulty ratings provided by the database. Each model provided independent answers three times under two distinct prompting strategies: a direct neutral prompt and a role-based prompt. Fleiss’ kappa (κ) was used to assess inter-run response consistency, and overall accuracy was employed as the primary performance metric.

Accuracy: Gemini-3-Flash achieved the highest overall accuracy (83.3%), followed by GPT-o3 (79.2%) and DeepSeek-R1 (74.4%). GPT-4 (69.9%) and GPT-5 (69.1%) demonstrated the lowest accuracies. Consistency: GPT-o3 demonstrated the highest decision stability (κ = 0.966), followed by DeepSeek-R1 (κ = 0.904) and Gemini-3-Flash (κ = 0.860). GPT-5 exhibited the lowest stability (κ = 0.668). Influencing Factors: Prompting strategies did not significantly affect model accuracy. While Gemini-3-Flash remained stable across difficulty levels, DeepSeek-R1 and GPT-o3 showed enhanced relative performance on more complex tasks.

GPT-o3 and Gemini-3-Flash achieve superior stability and accuracy in ophthalmology Question Answering (QA), making them suitable for high-stakes clinical decision support. The open-source model DeepSeek-R1 shows competitive potential, especially in complex tasks. Notably, GPT-5 failed to surpass its predecessor in both accuracy and consistency in this specialized domain. Prompt engineering has a limited impact on performance for closed-ended medical questions. Future work should extend to multimodal integration and real-world clinical validation to enhance the practical utility and reliability of LLMs in medicine.

## Full-text entities

- **Chemicals:** Gemini (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12894337/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12894337/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/PMC12894337/full.md

---
Source: https://tomesphere.com/paper/PMC12894337