SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection

Joongwon Chae; Zhenyu Wang; Peiwu Qin

arXiv:2412.02565·cs.CV·January 6, 2026

SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection

Joongwon Chae, Zhenyu Wang, Peiwu Qin

PDF

Open Access 1 Repo

TL;DR

This paper presents SJTU, a novel framework that enhances multimodal models with spatial coordinate understanding to achieve unified image segmentation guided by natural language, improving localization accuracy and operational efficiency.

Contribution

Introduces a spatial coordinate detection framework that bridges vision-language interaction and segmentation, enabling precise target localization in multimodal models.

Findings

01

Achieves IoU of 0.5958 on COCO 2017

02

Achieves IoU of 0.6758 on Pascal VOC

03

Inference time of 7 seconds per image on RTX 3090

Abstract

Despite significant advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified Segmentation through Coordinate Detection, a framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jw-chae/sjtu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies · Speech and dialogue systems · Spatial Cognition and Navigation