MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang

TL;DR
MMSI-Bench is a new benchmark designed to evaluate multi-image spatial reasoning in multimodal large language models, revealing significant gaps between current models and human performance and providing insights for future improvements.
Contribution
The paper introduces MMSI-Bench, a comprehensive multi-image spatial reasoning benchmark with detailed questions and an error analysis pipeline, addressing a gap in existing single-image focused assessments.
Findings
Current models achieve only around 30-40% accuracy.
Humans score approximately 97% accuracy.
The benchmark reveals substantial room for improvement in multi-image spatial reasoning.
Abstract
Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging…
Peer Reviews
Decision·ICLR 2026 Poster
Novel Benchmark Scope: MMSI-Bench uniquely targets multi-image spatial reasoning — a critical yet underexplored capability for MLLMs and embodied AI systems. Prior works (e.g., BLINK, ReMI, MuirBench) only contain limited spatial sub-splits, while this benchmark provides systematic coverage. High-Quality, Human-Curated Data: Each question is manually designed and audited by multiple experts with reasoning explanations, ensuring clarity, difficulty, and lack of ambiguity. The benchmark’s constru
Manual Effort vs. Scalability: Although the manual curation ensures quality, it also limits scalability — future expansions may face bottlenecks unless semi-automatic generation or verification methods are introduced. Metric Simplicity: The benchmark reports only accuracy on multiple-choice tasks. Incorporating richer evaluation metrics (e.g., reasoning correctness or step alignment) could offer more granular insight. Potential Dataset Bias: While data diversity is claimed, the benchmark draws
1. The benchmark's core strength is its manual, expert-driven annotation process. The questions are linguistically diverse, non-trivial, and require spatial understanding. The problem of multi-image spatial reasoning is highly relevant and timely for advancing embodied AI and robotics, and this paper clearly demonstrates a critical capability gap. 2. The paper evaluates an extensive suite of 37 models, providing a valuable and comprehensive snapshot of the entire SOTA. The inclusion of "Human Pe
1. 1,000 samples is small size, especially when divided across 11 tasks. This limited scale is a direct trade-off for the high-quality manual annotation (300+ hours), but it makes the benchmark difficult to scale and creates a risk of models eventually overfitting to this specific test set. 2. Blind GPT-4o is not a suitable baseline since the questions depend heavily on the images. Language priors are unable to capture the context of the problem unless the images are described in words and the a
Originality: The focus on multi-image spatial reasoning fills a clear gap between single-image VQA and real-world embodied perception. The fully human-curated design adds credibility compared to prior template-based datasets. Quality: The taxonomy of spatial relations (camera, object, region) is systematic, and the annotation process with reasoning traces and multi-reviewer verification shows rigor. The large-scale evaluation across 37 models is comprehensive and carefully controlled. Clarity:
The dataset is still modest in scale (1k QA pairs), which limits generalization analysis. It would help to report variability or cross-split reliability. Many questions rely on human interpretation of viewpoint or direction. Some ambiguity might remain even with expert curation, which could affect reproducibility. The evaluation metric focuses only on answer accuracy; assessing reasoning trace similarity (e.g., using annotated rationales) could reveal finer-grained improvements. While the bench
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation
