# Comparative Assessment of Large Language Models in Optics and Refractive Surgery: Performance on Multiple-Choice Questions

**Authors:** Leah Attal, Elad Shvartz, Alon Gorenshtein, Shirley Pincovich, Daniel Bahir

PMC · DOI: 10.3390/vision9040085 · Vision · 2025-10-09

## TL;DR

This study tested seven AI models on ophthalmology questions, finding that some perform well in complex calculations and image analysis, suggesting potential for medical education.

## Contribution

The study introduces a comparative evaluation of LLMs in optics and refractive surgery, highlighting their potential for medical training.

## Key findings

- ChatGPT O1 achieved the highest overall accuracy (83.5%) in answering ophthalmology MCQs.
- DeepSeek V3 excelled in refractive surgery questions (89.7%) and ChatGPT O3 Mini in image analysis (88.2%).
- LLMs showed high accuracy in complex optical calculations and visual items, suggesting potential for medical education.

## Abstract

This study aimed to evaluate the performance of seven advanced AI Large Language Models (LLMs)—ChatGPT 4o, ChatGPT O3 Mini, ChatGPT O1, DeepSeek V3, DeepSeek R1, Gemini 2.0 Flash, and Grok-3—in answering multiple-choice questions (MCQs) in optics and refractive surgery, to assess their role in medical education for residents. The AI models were tested using 134 publicly available MCQs from national ophthalmology certification exams, categorized by the need to perform calculations, the relevant subspecialty, and the use of images. Accuracy was analyzed and compared statistically. ChatGPT O1 achieved the highest overall accuracy (83.5%), excelling in complex optical calculations (84.1%) and optics questions (82.4%). DeepSeek V3 displayed superior accuracy in refractive surgery-related questions (89.7%), followed by ChatGPT O3 Mini (88.4%). ChatGPT O3 Mini significantly outperformed others in image analysis, with 88.2% accuracy. Moreover, ChatGPT O1 demonstrated comparable accuracy rates for both calculated and non-calculated questions (84.1% vs. 83.3%). This is in stark contrast to other models, which exhibited significant discrepancies in accuracy for calculated and non-calculated questions. The findings highlight the ability of LLMs to achieve high accuracy in ophthalmology MCQs, particularly in complex optical calculations and visual items. These results suggest potential applications in exam preparation and medical training contexts, while underscoring the need for future studies designed to directly evaluate their role and impact in medical education. The findings highlight the significant potential of AI models in ophthalmology education, particularly in performing complex optical calculations and visual stem questions. Future studies should utilize larger, multilingual datasets to confirm and extend these preliminary findings.

## Full-text entities

- **Diseases:** retinal detachment (MESH:D012163), LLMs (MESH:D007806), psychiatric (MESH:D001523), genetic disorders (MESH:D030342), ocular diseases (MESH:D005128), AI (MESH:C538142), hallucinations (MESH:D006212), injury to (MESH:D014947), glaucoma (MESH:D005901), anxiety (MESH:D001007), retinal diseases (MESH:D012164)
- **Chemicals:** Gemini (-)
- **Species:** Liphistius sp. LM (species) [taxon 1285381], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12550897/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12550897/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/PMC12550897/full.md

---
Source: https://tomesphere.com/paper/PMC12550897