ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Qirui Shen; Wenda Wang; Jiachen Lu; Zilong Huang; Jin Bai; Lei He; Hongxuan Chen; Weixin Huang

arXiv:2605.20837·cs.CV·May 21, 2026

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang

PDF

1 Repo

TL;DR

ArchSIBench is a comprehensive benchmark designed to evaluate the architectural spatial intelligence of vision-language models across perception, reasoning, navigation, transformation, and configuration tasks.

Contribution

This work introduces a new benchmark with expert-annotated questions to systematically assess architectural spatial understanding in VLMs, highlighting current capabilities and gaps.

Findings

01

Most models differ significantly from human baselines.

02

Some models approach human performance without architectural training.

03

A gap remains in spatial transformation and configuration reasoning.

Abstract

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/ArchSIBench/ArchSIBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.