SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning
Yi Zhang, Youya Xia, Yong Wang, Meng Song, Xin Wu, Wenjun Wan, Bingbing Liu, AiXue Ye, Hongbo Zhang, Feng Wen

TL;DR
SSR introduces a novel framework for structured scene reasoning that effectively integrates 2D and 3D representations, enabling advanced spatial reasoning and achieving state-of-the-art results on spatial intelligence benchmarks.
Contribution
The paper presents a lightweight alignment mechanism and a scene graph generation pipeline that enhance spatial reasoning without extensive pre-training or large-scale alignment.
Findings
Achieves 73.9 on VSI-Bench, outperforming larger models.
Effectively integrates 2D and 3D data for spatial reasoning.
Demonstrates state-of-the-art performance on multiple benchmarks.
Abstract
While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization
