Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu; Fangneng Zhan; Kaichen Zhou; Yilun Du; Paul Pu Liang; Hanspeter Pfister

arXiv:2511.10946·cs.CV·April 16, 2026

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister

PDF

TL;DR

SandboxVLM introduces a 3D abstraction framework that significantly improves the spatial reasoning capabilities of vision-language models in zero-shot settings, without additional training.

Contribution

The paper presents SandboxVLM, a novel framework that encodes geometric and physical information to enhance 3D reasoning in VLMs, bridging the modality gap.

Findings

01

Achieved an 8.3% improvement on SAT Real benchmark.

02

Consistently improves spatial intelligence across multiple benchmarks.

03

Enhances 3D reasoning without additional training.

Abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.