3D Primitives are a Spatial Language for VLMs
Junze Liu, Kun Qian, Florian Dubost, Kai Zhong, Arvind Srinivasan, Nan Chen, Anping Wang, Sam Zhang, Alejandro Mottini, Qingjun Cui, Tian Wang

TL;DR
This paper demonstrates that using 3D geometric primitives as an intermediate representation enhances spatial reasoning in vision-language models, introducing new benchmarks and training strategies.
Contribution
It introduces SpatialBabel benchmark, Code-CoT inference strategy, and S$^{3}$-FT self-supervised fine-tuning, advancing primitive-based spatial understanding in VLMs.
Findings
VLMs' object detection varies significantly across scene-code languages.
Code-CoT improves spatial reasoning accuracy by up to 6.4%.
S$^{3}$-FT enhances model performance without human labels or teacher models.
Abstract
Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to across languages. Second, we propose \textbf{Code-CoT} (Code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
