Failures in Perspective-taking of Multimodal AI Systems
Bridget Leonard, Kristin Woodard, and Scott O. Murray

TL;DR
This paper investigates the limitations of multimodal AI systems in perspective-taking by comparing their spatial representations to human analog cognition, using cognitive science techniques to assess GPT-4o.
Contribution
It introduces a novel approach to evaluate AI perspective-taking by applying cognitive science methods, highlighting differences from human spatial cognition.
Findings
Current models rely on propositional spatial representations.
GPT-4o shows limited perspective-taking abilities.
Insights guide future AI model development.
Abstract
This study extends previous research on spatial representations in multimodal AI systems. Although current models demonstrate a rich understanding of spatial information from images, this information is rooted in propositional representations, which differ from the analog representations employed in human and animal spatial cognition. To further explore these limitations, we apply techniques from cognitive and developmental science to assess the perspective-taking abilities of GPT-4o. Our analysis enables a comparison between the cognitive development of the human brain and that of multimodal AI, offering guidance for future research and model development.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Geographic Information Systems Studies
