Using Perspectival Words Is Harder Than Vocabulary Words for Humans and Even More So for Multimodal Language Models
Dota Tianai Dong, Yifan Luo, Po-Ya Angela Wang, Asli Ozyurek, Paula Rubio-Fernandez

TL;DR
This study compares humans and multimodal language models in their use of perspectival words, revealing that models struggle more with these words than humans, especially with demonstratives, due to limitations in perspective-taking and spatial reasoning.
Contribution
The paper highlights the greater difficulty MLMs have with perspectival words compared to vocabulary, emphasizing their shortfall in pragmatic and social-cognitive skills.
Findings
MLMs perform worse than humans on perspectival words, especially demonstratives.
Instruction-based prompting reduces the gap for possessives but not for demonstratives.
Limitations in perspective-taking and spatial reasoning are key sources of the gaps.
Abstract
Multimodal language models (MLMs) increasingly demonstrate human-like communication, yet their use of everyday perspectival words remains poorly understood. To address this gap, we compare humans and MLMs in their use of three word types that impose increasing cognitive demands: vocabulary (for example, "boat" or "cup"), possessives (for example, "mine" versus "yours"), and demonstratives (for example, "this one" versus "that one"). Testing seven MLMs against human participants, we find that perspectival words are harder than vocabulary words for both groups. The gap is larger for MLMs: while models approach human-level performance on vocabulary, they show clear deficits with possessives and even greater difficulty with demonstratives. Ablation analyses indicate that limitations in perspective-taking and spatial reasoning are key sources of these gaps. Instruction-based prompting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
