SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection
Joongwon Chae, Zhenyu Wang, Peiwu Qin

TL;DR
This paper presents SJTU, a novel framework that enhances multimodal models with spatial coordinate understanding to achieve unified image segmentation guided by natural language, improving localization accuracy and operational efficiency.
Contribution
Introduces a spatial coordinate detection framework that bridges vision-language interaction and segmentation, enabling precise target localization in multimodal models.
Findings
Achieves IoU of 0.5958 on COCO 2017
Achieves IoU of 0.6758 on Pascal VOC
Inference time of 7 seconds per image on RTX 3090
Abstract
Despite significant advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified Segmentation through Coordinate Detection, a framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies · Speech and dialogue systems · Spatial Cognition and Navigation
