Using Perspectival Words Is Harder Than Vocabulary Words for Humans and Even More So for Multimodal Language Models

Dota Tianai Dong; Yifan Luo; Po-Ya Angela Wang; Asli Ozyurek; Paula Rubio-Fernandez

arXiv:2506.00065·cs.CL·April 21, 2026

Using Perspectival Words Is Harder Than Vocabulary Words for Humans and Even More So for Multimodal Language Models

Dota Tianai Dong, Yifan Luo, Po-Ya Angela Wang, Asli Ozyurek, Paula Rubio-Fernandez

PDF

TL;DR

This study compares humans and multimodal language models in their use of perspectival words, revealing that models struggle more with these words than humans, especially with demonstratives, due to limitations in perspective-taking and spatial reasoning.

Contribution

The paper highlights the greater difficulty MLMs have with perspectival words compared to vocabulary, emphasizing their shortfall in pragmatic and social-cognitive skills.

Findings

01

MLMs perform worse than humans on perspectival words, especially demonstratives.

02

Instruction-based prompting reduces the gap for possessives but not for demonstratives.

03

Limitations in perspective-taking and spatial reasoning are key sources of the gaps.

Abstract

Multimodal language models (MLMs) increasingly demonstrate human-like communication, yet their use of everyday perspectival words remains poorly understood. To address this gap, we compare humans and MLMs in their use of three word types that impose increasing cognitive demands: vocabulary (for example, "boat" or "cup"), possessives (for example, "mine" versus "yours"), and demonstratives (for example, "this one" versus "that one"). Testing seven MLMs against human participants, we find that perspectival words are harder than vocabulary words for both groups. The gap is larger for MLMs: while models approach human-level performance on vocabulary, they show clear deficits with possessives and even greater difficulty with demonstratives. Ablation analyses indicate that limitations in perspective-taking and spatial reasoning are key sources of these gaps. Instruction-based prompting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.