Visuospatial Perspective Taking in Multimodal Language Models

Jonathan Prunty; Seraphina Zhang; Patrick Quinn; Jianxun Lian; Xing Xie; Lucy Cheke

arXiv:2603.23510·cs.CL·March 26, 2026

Visuospatial Perspective Taking in Multimodal Language Models

Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, Lucy Cheke

PDF

Open Access

TL;DR

This paper evaluates multimodal language models' ability to perform visuospatial perspective-taking, revealing significant limitations especially in complex Level 2 tasks that require adopting others' viewpoints.

Contribution

It introduces adapted human perspective-taking tasks to assess MLMs' visuospatial reasoning, highlighting their deficits in perspective inhibition and adoption.

Findings

01

MLMs struggle with Level 2 VPT tasks.

02

Current MLMs have limited perspective-taking capabilities.

03

Implications for collaborative AI applications.

Abstract

As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Action Observation and Synchronization