Visuospatial Perspective Taking in Multimodal Language Models
Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, Lucy Cheke

TL;DR
This paper evaluates multimodal language models' ability to perform visuospatial perspective-taking, revealing significant limitations especially in complex Level 2 tasks that require adopting others' viewpoints.
Contribution
It introduces adapted human perspective-taking tasks to assess MLMs' visuospatial reasoning, highlighting their deficits in perspective inhibition and adoption.
Findings
MLMs struggle with Level 2 VPT tasks.
Current MLMs have limited perspective-taking capabilities.
Implications for collaborative AI applications.
Abstract
As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Action Observation and Synchronization
