MedSG-Bench: A Benchmark for Medical Image Sequences Grounding
Jingkun Yue, Siqi Zhang, Zinan Jia, Huihuan Xu, Zongbo Han, Xiaohong Liu, Guangyu Wang

TL;DR
MedSG-Bench introduces the first comprehensive benchmark for medical image sequences grounding, addressing the gap in sequential image analysis for clinical applications, and includes datasets, tasks, and models to advance research in this area.
Contribution
It presents MedSG-Bench, a novel benchmark with datasets and tasks for medical image sequences grounding, and introduces MedSeq-Grounder, a specialized model for this purpose.
Findings
Existing models show limitations in sequential medical grounding tasks.
MedSG-188K dataset supports large-scale instruction tuning.
MedSeq-Grounder enhances understanding across sequential medical images.
Abstract
Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsFocus
