SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

Yi Zhang; Youya Xia; Yong Wang; Meng Song; Xin Wu; Wenjun Wan; Bingbing Liu; AiXue Ye; Hongbo Zhang; Feng Wen

arXiv:2603.00409·cs.CV·March 3, 2026

SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

Yi Zhang, Youya Xia, Yong Wang, Meng Song, Xin Wu, Wenjun Wan, Bingbing Liu, AiXue Ye, Hongbo Zhang, Feng Wen

PDF

Open Access

TL;DR

SSR introduces a novel framework for structured scene reasoning that effectively integrates 2D and 3D representations, enabling advanced spatial reasoning and achieving state-of-the-art results on spatial intelligence benchmarks.

Contribution

The paper presents a lightweight alignment mechanism and a scene graph generation pipeline that enhance spatial reasoning without extensive pre-training or large-scale alignment.

Findings

01

Achieves 73.9 on VSI-Bench, outperforming larger models.

02

Effectively integrates 2D and 3D data for spatial reasoning.

03

Demonstrates state-of-the-art performance on multiple benchmarks.

Abstract

While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization