# Evaluating the capability of large language models in radiotherapy through professional certification examinations in Japan

**Authors:** Noriyuki Kadoya, Yoshiyuki Takahashi, Seiya Koga, Hikaru Tanno, Kazuhiro Arai, Shohei Tanaka, Yoshiyuki Katsuta, Hinako Harada, So Omata, Takaya Yamamoto, Rei Umezawa, Ken Takeda, Keiichi Jingu

PMC · DOI: 10.1093/jrr/rraf083 · 2026-01-10

## TL;DR

This study tested how well large language models perform on Japanese radiotherapy certification exams, finding that some models, like ChatGPT-5 Pro, achieved over 90% accuracy.

## Contribution

The study evaluates LLMs on professional radiotherapy certification exams in Japan, revealing their high accuracy and potential for clinical applications.

## Key findings

- ChatGPT-5 Pro achieved the highest average accuracy of 94.7% across exams.
- All tested LLMs scored above 75% accuracy on radiotherapy certification questions.
- Advanced LLMs show strong potential for use in radiotherapy tasks like treatment planning.

## Abstract

Large language models (LLMs), such as ChatGPT and Grok, have rapidly advanced in natural language understanding and are increasingly being applied to specialized fields, including medicine. In this study, we evaluated the domain-specific knowledge of LLMs in radiotherapy by assessing their performance on three certification examinations in Japan: the Japanese Medical Physicist Examination, the Japanese Board Examination for Radiologists and the Japanese Board Examination for Radiation Oncologists. We assessed five LLMs—ChatGPT-5, ChatGPT-5 Pro, Grok 4, Grok 4 heavy and Gemini 2.5 Pro—by inputting all multiple-choice questions from these exams into each model and recording their responses. The AI-generated answers were compared with reference answers determined by experienced medical physicists and radiation oncologists. The results demonstrated average accuracies of 84.7 ± 2.0% (ChatGPT-5), 94.7 ± 2.1% (ChatGPT-5 Pro), 78.4 ± 1.2% (Grok 4), 81.6 ± 2.2% (Grok 4 heavy) and 88.9 ± 1.2% (Gemini 2.5 Pro). All models achieved over 75% accuracy, with ChatGPT-5 Pro consistently outperforming others, attaining an average accuracy exceeding 90% across all examinations. These findings highlight the strong potential of advanced LLMs, particularly ChatGPT-5 Pro, for future integration into radiotherapy-related applications such as automated contouring and treatment planning support.

## Full-text entities

- **Diseases:** stage III lung cancer (MESH:D008175), LLMs (MESH:D007806), breast cancer (MESH:D001943), stage III disease (MESH:D007676)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12856024/full.md

---
Source: https://tomesphere.com/paper/PMC12856024