Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

Animesh Maheshwari; Divyansh Sahu; Nishit Verma

arXiv:2605.20448·cs.CV·May 21, 2026

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

PDF

TL;DR

This paper introduces a comprehensive benchmark to evaluate vision-language models' understanding of 3D spatial relationships, revealing significant limitations in their grasp of occlusion and reflections despite good object naming.

Contribution

It presents a novel, human-curated benchmark targeting three aspects of 3D spatial understanding in vision-language models, with detailed analysis of model failures.

Findings

01

Models excel at volumetric rearrangement planning with 53-97% accuracy.

02

Models perform poorly on occlusion and reflection tasks, below 45%.

03

Analysis shows spatial information loss occurs after visual token merging.

Abstract

Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.