Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

Chen Yang; Guanxin Lin; Youquan He; Peiyao Chen; Guanghe Liu; Yufan Mo; Zhouyuan Xu; Linhao Wang; Guohui Zhang; Zihang Zhang; Shenxiang Zeng; Chen Wang; Jiansheng Fan

arXiv:2602.07864·cs.CV·February 10, 2026

Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, Jiansheng Fan

PDF

Open Access 2 Datasets

TL;DR

This paper introduces SSI-Bench, a challenging VQA benchmark for spatial reasoning on constrained 3D structures, revealing significant gaps between current models and human performance.

Contribution

It presents SSI-Bench, a novel benchmark with complex real-world 3D structures for evaluating spatial reasoning in vision-language models, created through a human-centered process.

Findings

01

Current models perform poorly compared to humans on SSI-Bench.

02

Models show limited improvement even when encouraged to think.

03

Error analysis highlights failures in structural grounding and 3D reasoning.

Abstract

Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Robotics and Sensor-Based Localization