Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

Zihao Dongfang; Xu Zheng; Ziqiao Weng; Yuanhuiyi Lyu; Danda Pani Paudel; Luc Van Gool; Kailun Yang; Xuming Hu

arXiv:2505.11907·cs.CV·May 20, 2025

Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, Xuming Hu

PDF

Open Access

TL;DR

This paper introduces OSR-Bench, a comprehensive benchmark for evaluating multimodal large language models on omnidirectional spatial reasoning tasks using panoramic indoor scenes, revealing current models' limitations.

Contribution

The paper presents OSR-Bench, the first benchmark for omnidirectional spatial reasoning in MLLMs, along with a novel negative sampling strategy and a two-stage evaluation framework.

Findings

01

Current MLLMs perform poorly on panoramic spatial reasoning tasks.

02

OSR-Bench reveals significant gaps in models' perceptual grounding abilities.

03

Evaluation of eight state-of-the-art models highlights the need for more perceptually grounded MLLMs.

Abstract

The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization

MethodsFocus