SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Jian Zhang; Shijie Zhou; Bangya Liu; Achuta Kadambi; Zhiwen Fan

arXiv:2603.27437·cs.CV·May 5, 2026

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Jian Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan

PDF

1 Models

TL;DR

SpatialStack introduces a hierarchical fusion framework that enhances 3D spatial reasoning in vision-language models by aligning multi-level geometric and semantic features, leading to state-of-the-art performance.

Contribution

It proposes a novel multi-level fusion approach that synchronizes geometric and language representations across the model hierarchy, improving 3D spatial understanding.

Findings

01

Achieves state-of-the-art results on multiple 3D spatial reasoning benchmarks.

02

Multi-level fusion consistently improves 3D understanding across tasks.

03

Demonstrates robustness and generalization in diverse spatial reasoning scenarios.

Abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Journey9ni/SpatialStack-Qwen3.5-4B
model· 339 dl
339 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.