# Textbook-level medical knowledge in large language models: comparative evaluation using Japanese National Medical Examination

**Authors:** Mingxin Liu, Tsuyoshi Okuhara, Zhehao Dai, Minghong Zhao, Wenqiang Yin, Hiroko Okada, Emi Furukawa, Takahiro Kiuchi

PMC · DOI: 10.1186/s12911-026-03370-y · BMC Medical Informatics and Decision Making · 2026-02-03

## TL;DR

This study evaluates the performance of four advanced AI models on a Japanese medical exam, finding they perform well but struggle with clinical reasoning.

## Contribution

The study provides the first comparative evaluation of the latest large language models on the Japanese National Medical Examination.

## Key findings

- Gemini 2.5 Pro achieved the highest overall accuracy at 97.2% on the Japanese National Medical Examination.
- LLMs performed significantly worse on clinical questions involving complex contexts and diagnostic imaging.
- All four models exceeded a 95% accuracy benchmark, suggesting potential for use in medical education.

## Abstract

The accuracy of the latest reasoning-enhanced large language models on national medical licensing examinations remains unknown, which is crucial for determining how close they are to serving as effective knowledge sources for medical education. This study aimed to evaluate the performance of four reasoning-enhanced large language models (LLMs)—GPT-5, Grok-4, Claude Opus 4.1, and Gemini 2.5 Pro—on the Japanese National Medical Examination (JNME), providing insights into their potential as educational resources and their future applicability in medical practice.

We evaluated LLM performance using the 2019 and 2025 JNME (n = 793). Questions were entered into each model with chain-of-thought prompting enabled. Accuracy was assessed overall and by question type. Incorrect responses were qualitatively reviewed by a licensed physician and a medical student.

From highest to lowest, the overall accuracies of the four LLMs were 97.2% for Gemini 2.5 Pro, 96.3% for GPT-5, 96.1% for Claude Opus 4.1, and 95.6% for Grok-4, with no significant pairwise differences. For image-based and non-image-based items, Gemini 2.5 Pro achieved the highest accuracy of 96.1% and 97.6%, with no significant difference, whereas accuracy was significantly lower on image-based items for the other three LLMs. Across difficulty levels, Gemini 2.5 Pro again achieved the highest accuracy (98.4% for easy, 97.3% for moderate, and 93.2% for difficult items). Within each LLM, accuracy on difficult questions was significantly lower than on easy questions. Common error patterns included providing unnecessary additional options in single-choice questions, misdiagnosis of X-ray or computed tomography images (primarily due to confusion regarding left–right laterality), and difficulties in prioritizing appropriate actions in clinical questions with complex contextual information.

Four LLMs released in 2025 surpassed the 95% benchmark on the JNME, and their near-perfect (approximately 99%) performance on basic medical knowledge questions highlights substantial potential for use as learning resources in foundational medical education. Gemini 2.5 Pro demonstrated the most consistent performance across question types, while Grok-4 showed greater variability. The concentration of incorrectness in clinical questions indicates that LLMs still require substantial refinement and validation before their use can be extended to clinical reasoning or patient care.

The online version contains supplementary material available at 10.1186/s12911-026-03370-y.

## Full-text entities

- **Diseases:** AI hallucinations (MESH:D006212), LLM (MESH:D007806), skin lesions (MESH:D012871)
- **Chemicals:** Gemini (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12958580/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12958580/full.md

## References

27 references — full list in the complete paper: https://tomesphere.com/paper/PMC12958580/full.md

---
Source: https://tomesphere.com/paper/PMC12958580