Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

Bridget Leonard; Scott O. Murray

arXiv:2601.16378·cs.CV·January 26, 2026

Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

Bridget Leonard, Scott O. Murray

PDF

Open Access

TL;DR

This paper introduces perspective tokens inspired by human spatial cognition to improve multimodal models' ability to adopt other agents' visual perspectives, enhancing spatial reasoning and reducing egocentric bias.

Contribution

The paper proposes a novel embedding method called perspective tokens that encode orientation, enabling models to perform better in perspective-taking tasks and generalize across different agents.

Findings

01

Perspective tokens improve accuracy on visual perspective-taking benchmarks.

02

Rotation-based tokens generalize to non-human reference agents.

03

Fine-tuning enhances latent orientation sensitivity in models.

Abstract

Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpatial Cognition and Navigation · Categorization, perception, and language · Multimodal Machine Learning Applications