Medical Image Spatial Grounding with Semantic Sampling
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura, Mingrui Yang, Xiaojuan Li, Vipin Chaudhary

TL;DR
This paper introduces MIS-Ground, a comprehensive benchmark for evaluating medical image spatial grounding in vision language models, and proposes MIS-SemSam, a semantic sampling method that enhances grounding accuracy.
Contribution
The study presents a new benchmark for assessing VLMs in medical spatial grounding and introduces a model-agnostic optimization technique that significantly improves performance.
Findings
MIS-SemSam improves Qwen3-VL-32B accuracy by 13.06%.
Varying visual and textual prompts affect spatial grounding performance.
MIS-Ground enables reproducible evaluation of VLM vulnerabilities.
Abstract
Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges. In this study, we examine image modalities, slice directions, and coordinate systems as differentiating factors for vision components of VLMs, and the use of anatomical, directional, and relational terminology as factors for the language components. We then demonstrate that visual and textual prompting systems such as labels, bounding boxes, and mask overlays have varying effects on the spatial grounding ability of VLMs. To enable measurement and reproducibility, we introduce MIS-Ground, a benchmark that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Healthcare and Education
