MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
Jiachun Li, Shaoping Huang, Zhuoran Jin, Chenlong Zhang, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR
MMR-Life is a new benchmark with 2,646 questions across real-world images designed to evaluate multimodal models' diverse reasoning skills, revealing current models' limitations and guiding future improvements.
Contribution
Introduces MMR-Life, a comprehensive benchmark for assessing multimodal multi-image reasoning across real-life scenarios, covering seven reasoning types without domain-specific reliance.
Findings
Top models like GPT-5 achieve only 58% accuracy.
Performance varies significantly across reasoning types.
Factors like reasoning length and method influence model performance.
Abstract
Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead…
Peer Reviews
Decision·ICLR 2026 Poster
1. The benchmark uses 2,676 MCQ questions built from 19,367 real-world images, with many questions requiring information integration across several images, so models cannot rely on single-image shortcuts. 2. Reasoning tasks are organized into seven well-defined types (abductive, analogical, causal, deductive, inductive, spatial, temporal) and 21 tasks, providing a structured view of where current models fail. 3. The analysis across reasoning types shows differentiated difficulty profiles, which
1. The paper analyzes the effect of longer thinking traces only at the overall level, lack discussion by break down reasoning type, it is unclear whether “longer thinking ≠ better” holds uniformly across tasks. 2. Human-level performance is based on a subset (validation questions), whereas model results are reported on the full benchmark, which makes the human–model comparison not fully aligned and may overstate the human gap. 3. Error analysis stays relatively shallow and mostly descriptive, co
1. The paper tackles an important and timely problem - evaluating whether MLLMs can effectively handle multi-image, vision-based reasoning tasks, a topic of clear relevance to the ICLR community. 2. The MMR-Life benchmark covers a broad range of reasoning types across diverse real-world image contexts, enabling fine-grained analysis of model strengths and weaknesses and offering a well-structured evaluation framework. 3. The evaluation is thorough and well-rounded, including a diverse set of bot
1. The benchmark largely repurposes existing datasets (Appendix C.1), which limits its novelty. The primary contribution lies in the reorganization and categorization of these datasets by reasoning type rather than in introducing new data or task formulations. 2. The evaluation omits several important baselines, including representative MLLMs (e.g., LLaVA, InstructBLIP) and non-MLLM approaches such as supervised CNNs or few-shot/meta-learning methods (Prototypical Networks, SNAIL, MetaBaseline).
- The paper is well-written and easy to follow, with an abundance of examples. - The benchmark contains interesting challenges and performs a pretty thorough evaluation of sota models, showing their deficiency. - The exploration of reasoning enhancement strategies provides some interesting insights.
- I appreciate the rich appendix section and accompanying analysis. But these examples are most likely cherry-picked, which is totally fine for presentation. It would be good to include a link to an anonymously hosted repository containing the full dataset so reviewers can get a more unbiased, holistic view of data quality. - Related to above: Some of the problems in the appendix are of low quality. - For example, on page 65, the ground truth explanation states that the answer should be "wa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
