TL;DR
SpatialStack introduces a hierarchical fusion framework that enhances 3D spatial reasoning in vision-language models by aligning multi-level geometric and semantic features, leading to state-of-the-art performance.
Contribution
It proposes a novel multi-level fusion approach that synchronizes geometric and language representations across the model hierarchy, improving 3D spatial understanding.
Findings
Achieves state-of-the-art results on multiple 3D spatial reasoning benchmarks.
Multi-level fusion consistently improves 3D understanding across tasks.
Demonstrates robustness and generalization in diverse spatial reasoning scenarios.
Abstract
Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
