Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
Mashrafi Monon, Umaima Rahman, Asif Hanif, Numan Saeed, Mohammad Yaqub

TL;DR
This paper introduces CT-SpatialVQA, a benchmark for evaluating the semantic-spatial reasoning abilities of 3D medical vision-language models using a large set of clinically grounded questions derived from CT scans and reports.
Contribution
The paper presents a new benchmark dataset and evaluation protocol for assessing spatial reasoning in 3D medical VLMs, revealing current models' limitations in this area.
Findings
Models show severe degradation on spatial reasoning tasks, averaging 34% accuracy.
Performance often falls below random chance, indicating poor spatial understanding.
The benchmark highlights the need for better integration of volumetric evidence in models.
Abstract
Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
