HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Huizhi Liang; Yichao Shen; Yu Deng; Sicheng Xu; Zhiyuan Feng; Tong Zhang; Yaobo Liang; Jiaolong Yang

arXiv:2603.25411·cs.CV·March 27, 2026

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Huizhi Liang, Yichao Shen, Yu Deng, Sicheng Xu, Zhiyuan Feng, Tong Zhang, Yaobo Liang, Jiaolong Yang

PDF

Open Access

TL;DR

HiSpatial introduces a hierarchical framework and an extensive dataset to improve 3D spatial understanding in vision-language models, achieving state-of-the-art results in spatial reasoning tasks.

Contribution

The paper presents a novel hierarchical approach and an automated pipeline for training VLMs on 3D spatial tasks, along with an RGB-D model that enhances spatial comprehension.

Findings

01

Achieves state-of-the-art performance on spatial reasoning benchmarks.

02

Demonstrates the effectiveness of hierarchical task design.

03

Surpasses specialized spatial models and large proprietary systems.

Abstract

Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications