Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

TL;DR
This paper introduces MV-RoboBench, a benchmark for evaluating multi-view spatial reasoning in vision-language models within robotic scenes, revealing current models' limitations and the importance of multi-view understanding for robotic tasks.
Contribution
The paper presents MV-RoboBench, a new benchmark for multi-view spatial reasoning in robotic contexts, and evaluates existing models, highlighting their shortcomings and the gap to human performance.
Findings
State-of-the-art models perform far below humans in multi-view robotic reasoning.
Spatial intelligence correlates with robotic task success in multi-view scenarios.
Single-view benchmarks do not predict multi-view robotic reasoning performance.
Abstract
Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided…
Peer Reviews
Decision·ICLR 2026 Poster
+The paper is clearly written and easy to follow. The topic (benchmarking of AI models, embodied AI, spatial AI) is relevant to a broad spectrum of ICLR community members. +The proposed benchmark is constructed and curated in a rigorous fashion, where human raters, instead of automatic pipelines, perform most annotation tasks. Compared to related benchmarks, this alleviate the difficulty of data curation and potentially help with a higher dataset quality. +I appreciate the authors for arrangin
My primary concerns lies with the validity of the "multi-view" setting in the current dataset format and relation to embodied AI tasks. -It seems that many of the designed tasks do not nessitate the need for multi-view input. For the tasks illustrated in Figure 1, most tasks seems doable with only one view, except for viewpoint identification and cross-view matching. If this is the case, I would like to see an ablation on using only one view (say the center view) as input and see if there is s
1. The paper has good structure and easy to follow. 2. The author conducted a large scale experiment on multiple modern VLMs, which makes the experimental results and conclusions convicible. 3. The spacial reasoning ability is critical for the development of VLMs. This paper provided a good example of how to evaluate the VLMs and thus might have broad influence to the community.
1. In abstract, should not use abbreviation when the the first time "CoT-inspired techniques" appears. Therefore "CoT-inspired techniques" -> "Chain of Thought (CoT)-inspired techniques." 2. Does the image orientation matters? For example, if the head camera view is upside down, can the system still get correct inference? This discussion is necessary to determine the VLMs' generalization on spacial reasoning. 3. The benchmark have multiple-choice questions across eight subtasks, each with exactl
1. The proposed benchmark addresses an important and practical problem in robotics. 2. The annotations for the benchmark are manually collected, which helps ensure their correctness and reliability. 3. This work also investigates the effects of CoT prompting and uncovers two key correlations, between spatial reasoning and robotic execution, and between single- and multi-view understanding, offering interesting insights to the research community. 4. The presentation is easy to follow.
1. The paper claims to be the first to integrate spatial and robotic reasoning with synchronized multi-view inputs in robotic manipulation scenarios. However, the previous ERQA benchmark also includes some multi-view spatial reasoning and manipulation questions in its test set. Although such samples are fewer and many ERQA items are single-image based, it nevertheless contains similar tasks, such as cross-view matching, as those in the proposed benchmark. Therefore, the authors should discuss ER
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization
