Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, Xuming Hu

TL;DR
This paper introduces OSR-Bench, a comprehensive benchmark for evaluating multimodal large language models on omnidirectional spatial reasoning tasks using panoramic indoor scenes, revealing current models' limitations.
Contribution
The paper presents OSR-Bench, the first benchmark for omnidirectional spatial reasoning in MLLMs, along with a novel negative sampling strategy and a two-stage evaluation framework.
Findings
Current MLLMs perform poorly on panoramic spatial reasoning tasks.
OSR-Bench reveals significant gaps in models' perceptual grounding abilities.
Evaluation of eight state-of-the-art models highlights the need for more perceptually grounded MLLMs.
Abstract
The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
MethodsFocus
