Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology

Mingzhe Hu; Zach Eidex; Shansong Wang; Mojtaba Safari; Qiang Li; and Xiaofeng Yang

arXiv:2508.13192·eess.IV·August 20, 2025

Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology

Mingzhe Hu, Zach Eidex, Shansong Wang, Mojtaba Safari, Qiang Li, and Xiaofeng Yang

PDF

TL;DR

This paper evaluates GPT-5's zero-shot multimodal reasoning capabilities in radiology and radiation oncology, showing significant performance improvements over GPT-4o in medical image understanding and physics problem-solving.

Contribution

It provides the first comprehensive benchmarking of GPT-5's multimodal medical reasoning in high-stakes domains, demonstrating its superior accuracy over previous models.

Findings

01

GPT-5 outperforms GPT-4o across all datasets.

02

GPT-5 achieves 90.7% accuracy on physics questions, surpassing human passing threshold.

03

Significant gains in challenging anatomical regions and domain-specific tasks.

Abstract

Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.