Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Gracjan G\'oral; Alicja Ziarko; Piotr Mi{\l}o\'s; Micha{\l} Nauman; Maciej Wo{\l}czyk; Micha{\l} Kosi\'nski

arXiv:2505.03821·cs.CV·March 31, 2026

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Gracjan G\'oral, Alicja Ziarko, Piotr Mi{\l}o\'s, Micha{\l} Nauman, Maciej Wo{\l}czyk, Micha{\l} Kosi\'nski

PDF

1 Datasets

TL;DR

This paper assesses Vision Language Models' ability to perform visual perspective taking through controlled spatial tasks, revealing strengths in scene understanding but significant challenges in spatial reasoning and perspective taking.

Contribution

It introduces a new set of visual tasks inspired by human tests to evaluate VLMs' spatial and perspective reasoning capabilities.

Findings

01

Models excel at scene understanding but struggle with spatial reasoning.

02

Performance drops significantly on perspective taking tasks.

03

Highlights the need for explicit geometric reasoning in VLMs.

Abstract

We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a new set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations -- such as object position relative to the minifigure and the minifigure's orientation -- and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. We evaluate several high-performing models, including Gemini Robotics-ER 1.5, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, GPT-4, and Qwen3, and find that while they excel at scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Gracjan/Isle
dataset· 396 dl
396 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.