CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments

Haotian Xu; Yue Hu; Zhengqiu Zhu; Chen Gao; Ziyou Wang; Junreng Rao; Wenhao Lu; Weishi Li; Quanjun Yin; Yong Li

arXiv:2601.14339·cs.CV·January 22, 2026

CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments

Haotian Xu, Yue Hu, Zhengqiu Zhu, Chen Gao, Ziyou Wang, Junreng Rao, Wenhao Lu, Weishi Li, Quanjun Yin, Yong Li

PDF

Open Access

TL;DR

CityCube is a new benchmark designed to evaluate vision-language models' ability to perform cross-view spatial reasoning in complex urban environments, revealing significant gaps between current models and human performance.

Contribution

It introduces a comprehensive urban-focused benchmark with diverse viewpoints and annotated QA pairs, addressing a gap in existing spatial reasoning benchmarks.

Findings

01

Current VLMs perform significantly worse than humans in urban spatial reasoning.

02

Small-scale fine-tuned VLMs outperform large-scale models on this benchmark.

03

There is a fundamental cognitive gap between VLMs and human spatial reasoning.

Abstract

Cross-view spatial reasoning is essential for embodied AI, underpinning spatial understanding, mental simulation and planning in complex environments. Existing benchmarks primarily emphasize indoor or street settings, overlooking the unique challenges of open-ended urban spaces characterized by rich semantics, complex geometries, and view variations. To address this, we introduce CityCube, a systematic benchmark designed to probe cross-view reasoning capabilities of current VLMs in urban settings. CityCube integrates four viewpoint dynamics to mimic camera movements and spans a wide spectrum of perspectives from multiple platforms, e.g., vehicles, drones and satellites. For a comprehensive assessment, it features 5,022 meticulously annotated multi-view QA pairs categorized into five cognitive dimensions and three spatial relation expressions. A comprehensive evaluation of 33 VLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Constraint Satisfaction and Optimization