Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
Bridget Leonard, Scott O. Murray

TL;DR
This paper introduces perspective tokens inspired by human spatial cognition to improve multimodal models' ability to adopt other agents' visual perspectives, enhancing spatial reasoning and reducing egocentric bias.
Contribution
The paper proposes a novel embedding method called perspective tokens that encode orientation, enabling models to perform better in perspective-taking tasks and generalize across different agents.
Findings
Perspective tokens improve accuracy on visual perspective-taking benchmarks.
Rotation-based tokens generalize to non-human reference agents.
Fine-tuning enhances latent orientation sensitivity in models.
Abstract
Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpatial Cognition and Navigation · Categorization, perception, and language · Multimodal Machine Learning Applications
