Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

Yuangong Chen; Wai Keung Wong; Jiaxing Li; Ioannis Patras; Xu Zheng

arXiv:2605.12413·cs.CV·May 19, 2026

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

Yuangong Chen, Wai Keung Wong, Jiaxing Li, Ioannis Patras, Xu Zheng

PDF

TL;DR

This paper introduces a diagnostic benchmark, PCSR-Bench, to evaluate and analyze the spatial reasoning capabilities of Multimodal Large Language Models in omnidirectional images, revealing significant gaps and potential for improvement.

Contribution

The paper presents PCSR-Bench, a large-scale benchmark for perspective-conditioned spatial reasoning, and investigates the plasticity of MLLMs through RL-based fine-tuning, highlighting key challenges and opportunities.

Findings

01

MLLMs achieve 57.59% accuracy on basic relative direction tasks.

02

Accuracy drops sharply to 0.64% on open-ended compositional reasoning.

03

Reward shaping improves a 7B model's performance from 31.10% to 60.06%.

Abstract

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.