HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
Huizhi Liang, Yichao Shen, Yu Deng, Sicheng Xu, Zhiyuan Feng, Tong Zhang, Yaobo Liang, Jiaolong Yang

TL;DR
HiSpatial introduces a hierarchical framework and an extensive dataset to improve 3D spatial understanding in vision-language models, achieving state-of-the-art results in spatial reasoning tasks.
Contribution
The paper presents a novel hierarchical approach and an automated pipeline for training VLMs on 3D spatial tasks, along with an RGB-D model that enhances spatial comprehension.
Findings
Achieves state-of-the-art performance on spatial reasoning benchmarks.
Demonstrates the effectiveness of hierarchical task design.
Surpasses specialized spatial models and large proprietary systems.
Abstract
Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
