Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

Quoc-Huy Trinh; Xi Ding; Yang Liu; Zhenyue Qin; Xingjian Li; Gorkem Durak; Halil Ertugrul Aktas; Elif Keles; Ulas Bagci; Min Xu

arXiv:2603.13800·cs.CV·March 17, 2026

Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

Quoc-Huy Trinh, Xi Ding, Yang Liu, Zhenyue Qin, Xingjian Li, Gorkem Durak, Halil Ertugrul Aktas, Elif Keles, Ulas Bagci, Min Xu

PDF

Open Access

TL;DR

This paper introduces SpatialMed, a new benchmark dataset for 3D spatial reasoning in medical multimodal large language models, revealing current models' limitations in spatial understanding of medical images.

Contribution

The study develops an agentic pipeline for synthesizing 3D spatial VQA data and presents the first comprehensive benchmark for evaluating spatial reasoning in medical MLLMs.

Findings

01

Current models lack robust spatial reasoning in medical imaging.

02

SpatialMed contains nearly 10,000 question-answer pairs across various organs.

03

Evaluations show significant gaps in models' spatial understanding capabilities.

Abstract

Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning