Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models
Gracjan G\'oral, Alicja Ziarko, Michal Nauman, Maciej Wo{\l}czyk

TL;DR
This paper evaluates the ability of vision language models to perform visual perspective-taking, introducing new datasets and revealing significant performance gaps, highlighting the need for better benchmarks.
Contribution
It introduces two new datasets for testing VPT in vision language models and provides a comprehensive evaluation of 12 models' perspective-taking capabilities.
Findings
Models perform poorly on perspective-taking tasks.
Performance on object detection does not correlate with VPT ability.
Existing benchmarks may be insufficient for assessing VPT in VLMs.
Abstract
Visual perspective-taking (VPT), the ability to understand the viewpoint of another person, enables individuals to anticipate the actions of other people. For instance, a driver can avoid accidents by assessing what pedestrians see. Humans typically develop this skill in early childhood, but it remains unclear whether the recently emerging Vision Language Models (VLMs) possess such capability. Furthermore, as these models are increasingly deployed in the real world, understanding how they perform nuanced tasks like VPT becomes essential. In this paper, we introduce two manually curated datasets, Isle-Bricks and Isle-Dots for testing VPT skills, and we use it to evaluate 12 commonly used VLMs. Across all models, we observe a significant performance drop when perspective-taking is required. Additionally, we find performance in object detection tasks is poorly correlated with performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language · Language, Metaphor, and Cognition
