Failures in Perspective-taking of Multimodal AI Systems

Bridget Leonard; Kristin Woodard; and Scott O. Murray

arXiv:2409.13929·cs.AI·September 24, 2024

Failures in Perspective-taking of Multimodal AI Systems

Bridget Leonard, Kristin Woodard, and Scott O. Murray

PDF

Open Access 1 Repo

TL;DR

This paper investigates the limitations of multimodal AI systems in perspective-taking by comparing their spatial representations to human analog cognition, using cognitive science techniques to assess GPT-4o.

Contribution

It introduces a novel approach to evaluate AI perspective-taking by applying cognitive science methods, highlighting differences from human spatial cognition.

Findings

01

Current models rely on propositional spatial representations.

02

GPT-4o shows limited perspective-taking abilities.

03

Insights guide future AI model development.

Abstract

This study extends previous research on spatial representations in multimodal AI systems. Although current models demonstrate a rich understanding of spatial information from images, this information is rooted in propositional representations, which differ from the analog representations employed in human and animal spatial cognition. To further explore these limitations, we apply techniques from cognitive and developmental science to assess the perspective-taking abilities of GPT-4o. Our analysis enables a comparison between the cognitive development of the human brain and that of multimodal AI, offering guidance for future research and model development.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bridgetleonard2/PerspectiveTaking
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Geographic Information Systems Studies