# Evaluation of multiple generative large language models on neurology board-style questions

**Authors:** Mohammad Almomani, Vijaya Valaparla, James Weatherhead, Xiang Fang, Alok Dabi, Chih-Ying Li, Peter McCaffrey, Dan Hier, Jorge Mario Rodríguez-Fernández

PMC · DOI: 10.3389/fdgth.2025.1737882 · Frontiers in Digital Health · 2026-01-05

## TL;DR

This study compared eight AI models and neurology residents on medical board-style questions, finding that top AI models outperformed residents in most areas.

## Contribution

The study introduces a benchmarking framework for evaluating LLMs on neurology board-style questions across subspecialties and cognitive levels.

## Key findings

- ChatGPT-5 and ChatGPT-4o outperformed residents and other models on both lower- and higher-order questions.
- Gemini 2.5 showed significant improvement over its predecessor but had uneven performance across domains.
- Confidence–accuracy calibration was weak across all models, suggesting a need for caution in their use.

## Abstract

To compare the performance of eight large language models (LLMs) with neurology residents on board-style multiple-choice questions across seven subspecialties and two cognitive levels.

In a cross-sectional benchmarking study, we evaluated Bard, Claude, Gemini v1, Gemini 2.5, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, and ChatGPT-5 using 107 text-only items spanning movement disorders, vascular neurology, neuroanatomy, neuroimmunology, epilepsy, neuromuscular disease, and neuro-infectious disease. Items were labeled as lower- or higher-order per Bloom's taxonomy by two neurologists. Models answered each item in a fresh session and reported confidence and Bloom classification. Residents completed the same set under exam-like conditions. Outcomes included overall and domain accuracies, guessing-adjusted accuracy, confidence–accuracy calibration (Spearman ρ), agreement with expert Bloom labels (Cohen κ), and inter-generation scaling (linear regression of topic-level accuracies). Group differences used Fisher exact or χ2 tests with Bonferroni correction.

Residents scored 64.9%. ChatGPT-5 achieved 84.1% and ChatGPT-4o 81.3%, followed by Gemini 2.5 at 77.6% and ChatGPT-4 at 68.2%; Claude (56.1%), Bard (54.2%), ChatGPT-3.5 (53.3%), and Gemini v1 (39.3%) underperformed residents. On higher-order items, ChatGPT-5 (86%) and ChatGPT-4o (82.5%) maintained superiority; Gemini 2.5 matched 82.5%. Guessing-adjusted accuracy preserved rank order (ChatGPT-5 78.8%, ChatGPT-4o 75.1%, Gemini 2.5 70.1%). Confidence–accuracy calibration was weak across models. Inter-generation scaling was strong within the ChatGPT lineage (ChatGPT-4 to 4o R2 = 0.765, p = 0.010; 4o to 5 R2 = 0.908, p < 0.001) but absent for Gemini v1 to 2.5 (R2 = 0.002, p = 0.918), suggesting discontinuous improvements.

LLMs—particularly ChatGPT-5 and ChatGPT-4o—exceeded resident performance on text-based neurology board-style questions across subspecialties and cognitive levels. Gemini 2.5 showed substantial gains over v1 but with domain-uneven scaling. Given weak confidence calibration, LLMs should be integrated as supervised educational adjuncts with ongoing validation, version governance, and transparent metadata to support safe use in neurology education.

## Full-text entities

- **Diseases:** vascular neurology (MESH:D020785), epilepsy (MESH:D004827), neuromuscular disease (MESH:D009468), neuro-infectious disease (MESH:D003141), movement disorders (MESH:D009069)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12813092/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12813092/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC12813092/full.md

---
Source: https://tomesphere.com/paper/PMC12813092