Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?
Animesh Maheshwari, Divyansh Sahu, Nishit Verma

TL;DR
This paper introduces a comprehensive benchmark to evaluate vision-language models' understanding of 3D spatial relationships, revealing significant limitations in their grasp of occlusion and reflections despite good object naming.
Contribution
It presents a novel, human-curated benchmark targeting three aspects of 3D spatial understanding in vision-language models, with detailed analysis of model failures.
Findings
Models excel at volumetric rearrangement planning with 53-97% accuracy.
Models perform poorly on occlusion and reflection tasks, below 45%.
Analysis shows spatial information loss occurs after visual token merging.
Abstract
Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
