TL;DR
ArchSIBench is a comprehensive benchmark designed to evaluate the architectural spatial intelligence of vision-language models across perception, reasoning, navigation, transformation, and configuration tasks.
Contribution
This work introduces a new benchmark with expert-annotated questions to systematically assess architectural spatial understanding in VLMs, highlighting current capabilities and gaps.
Findings
Most models differ significantly from human baselines.
Some models approach human performance without architectural training.
A gap remains in spatial transformation and configuration reasoning.
Abstract
Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
