Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, Aishwarya Agrawal

TL;DR
This paper introduces COSMIC, a benchmark for evaluating multimodal language models' ability to develop shared spatial understanding through dialogue in 3D environments, revealing current limitations and room for improvement.
Contribution
The paper presents COSMIC, a new benchmark for collaborative spatial communication, and systematically evaluates MLLMs' capabilities in forming shared spatial mental models.
Findings
MLLMs reliably identify shared anchor objects across views.
MLLMs perform poorly on relational reasoning and global map building.
Human dialogues achieve 95% accuracy, while best models reach 72%, indicating significant gap.
Abstract
Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for frontier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
