Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
Yuangong Chen, Wai Keung Wong, Jiaxing Li, Ioannis Patras, Xu Zheng

TL;DR
This paper introduces a diagnostic benchmark, PCSR-Bench, to evaluate and analyze the spatial reasoning capabilities of Multimodal Large Language Models in omnidirectional images, revealing significant gaps and potential for improvement.
Contribution
The paper presents PCSR-Bench, a large-scale benchmark for perspective-conditioned spatial reasoning, and investigates the plasticity of MLLMs through RL-based fine-tuning, highlighting key challenges and opportunities.
Findings
MLLMs achieve 57.59% accuracy on basic relative direction tasks.
Accuracy drops sharply to 0.64% on open-ended compositional reasoning.
Reward shaping improves a 7B model's performance from 31.10% to 60.06%.
Abstract
Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
