SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR
This paper introduces SpatialScore, a comprehensive benchmark for evaluating multimodal large language models' spatial intelligence, along with datasets and a multi-agent system to improve spatial reasoning capabilities.
Contribution
It presents the most diverse spatial intelligence benchmark to date, evaluates models extensively, and offers new datasets and a multi-agent system to enhance spatial reasoning without additional training.
Findings
49 models evaluated revealing persistent challenges
SpatialCorpus improves model performance significantly
SpatialAgent enhances reasoning without extra training
Abstract
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 49 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
