Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
Lama Moukheiber, Caleb M. Yeung, Haotian Xue, Alec Helbling, Zelin Zhao, Yongxin Chen

TL;DR
This paper introduces a new benchmark for multi-frame, spatially grounded reasoning in volumetric MRI, emphasizing transparent reasoning and spatial evidence in vision-language models.
Contribution
It presents SGMRI-VQA, a large annotated dataset for clinical MRI reasoning, and demonstrates that targeted supervision improves model grounding performance.
Findings
Supervised fine-tuning with bounding box supervision enhances grounding accuracy.
Benchmarking 10 VLMs reveals the benefit of spatial supervision.
The dataset covers brain and knee MRI with expert annotations.
Abstract
Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
