# Benchmarking Large Language Models on the Taiwan Neurology Board Examinations (2018–2024): A Comparative Evaluation of GPT-4o, GPT-o1, DeepSeek-V3, and DeepSeek-R1

**Authors:** Shih-Yi Lin, Ying-Yu Hsu, Pei-Chun Yeh, Chien-Sheng Hsu, Wu-Huei Hsu, Shih-Sheng Chang, Chia-Hung Kao

PMC · DOI: 10.3390/bioengineering13030302 · Bioengineering · 2026-03-05

## TL;DR

This paper evaluates how well large language models perform on neurology board exams from Taiwan, finding that GPT-o1 outperforms others.

## Contribution

The study introduces a new benchmark using real neurology board exams to compare the performance of different large language models.

## Key findings

- GPT-o1 achieved the highest overall accuracy at 83.86%.
- DeepSeek-V3 had the lowest score at 65.62% with high variability.
- All models showed decreased accuracy in 2024 due to changes in question design.

## Abstract

Background and Purpose: Neurology requires integration of clinical reasoning, imaging interpretation, and current knowledge, making it an ideal field for evaluating large language models (LLMs). Methods: Using 1715 questions from the Taiwan Neurology Board Examination (2018–2024), we assessed four LLMs—GPT-4o, GPT-o1, DeepSeek-V3, and DeepSeek-R1—across four formats: single-choice, multiple-choice, true–false, and image-based items. Results: GPT-o1 achieved the highest overall accuracy (83.86%) and demonstrated strong performance on cognitively demanding tasks (82.50% on true–false; 77.26% on image-based). DeepSeek-V3 scored lowest (65.62%) and showed the greatest variability. Statistical analyses confirmed significant inter-model differences (p < 0.01). Accuracy declined across all models in 2024, coinciding with shifts in question design. DeepSeek-R1 was further penalized by alignment-based refusals, resulting in up to 3.81% score loss. Conclusions: These results position the Taiwan Neurology Board Exam as a rigorous benchmark for LLM evaluation and underscore GPT-o1’s potential utility in neurology education and decision support.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13024452/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC13024452/full.md

---
Source: https://tomesphere.com/paper/PMC13024452