Egocentric Bias in Vision-Language Models
Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, Dezhi Luo

TL;DR
This paper introduces FlipSet, a benchmark for assessing perspective-taking in vision-language models, revealing egocentric biases and highlighting limitations in models' spatial reasoning and social awareness integration.
Contribution
The paper presents FlipSet, a novel diagnostic benchmark for Level-2 visual perspective taking in vision-language models, exposing egocentric biases and a lack of spatial reasoning mechanisms.
Findings
Most models perform below chance in perspective-taking tasks.
Approximately three-quarters of errors are due to egocentric bias.
Models show a dissociation between theory-of-mind accuracy and spatial integration ability.
Abstract
Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpatial Cognition and Navigation · Action Observation and Synchronization · Categorization, perception, and language
