Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology
Mingzhe Hu, Zach Eidex, Shansong Wang, Mojtaba Safari, Qiang Li, and Xiaofeng Yang

TL;DR
This paper evaluates GPT-5's zero-shot multimodal reasoning capabilities in radiology and radiation oncology, showing significant performance improvements over GPT-4o in medical image understanding and physics problem-solving.
Contribution
It provides the first comprehensive benchmarking of GPT-5's multimodal medical reasoning in high-stakes domains, demonstrating its superior accuracy over previous models.
Findings
GPT-5 outperforms GPT-4o across all datasets.
GPT-5 achieves 90.7% accuracy on physics questions, surpassing human passing threshold.
Significant gains in challenging anatomical regions and domain-specific tasks.
Abstract
Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
