A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options

Peilong Wang; Jason Holmes; Zhengliang Liu; Dequan Chen; Tianming Liu; Jiajian Shen; Wei Liu

arXiv:2412.10622·physics.med-ph·July 22, 2025

A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options

Peilong Wang, Jason Holmes, Zhengliang Liu, Dequan Chen, Tianming Liu, Jiajian Shen, Wei Liu

PDF

TL;DR

This study evaluates recent large language models' ability to answer radiation oncology physics questions, showing they perform at expert level and can be improved with specific prompting strategies.

Contribution

The paper provides an updated assessment of multiple recent LLMs on radiation oncology physics questions, highlighting their high performance and potential educational utility.

Findings

01

All models achieved expert-level accuracy.

02

Replacing correct answers with 'None of the above' reduced performance.

03

Explain-first prompts improved reasoning in some models.

Abstract

Purpose: We present an updated study evaluating the performance of large language models (LLMs) in answering radiation oncology physics questions, focusing on the recently released models. Methods: A set of 100 multiple-choice radiation oncology physics questions, previously created by a well-experienced physicist, was used for this study. The answer options of the questions were randomly shuffled to create "new" exam sets. Five LLMs -- OpenAI o1-preview, GPT-4o, LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet -- with the versions released before September 30, 2024, were queried using these new exam sets. To evaluate their deductive reasoning ability, the correct answer options in the questions were replaced with "None of the above." Then, the explain-first and step-by-step instruction prompts were used to test if this strategy improved their reasoning ability. The performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training · LLaMA