MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR
MMR-V introduces a challenging benchmark for evaluating multimodal deep reasoning in videos, emphasizing long-range, multi-frame inference and reasoning beyond perception, revealing current models' limitations.
Contribution
This paper presents MMR-V, a novel benchmark designed to evaluate deep multimodal reasoning in videos, addressing gaps in existing datasets by focusing on long-range, hidden, and confusable reasoning tasks.
Findings
Current models achieve only around 52.5% accuracy on MMR-V.
Reasoning strategies like Chain-of-Thought have limited impact.
Multi-modal reasoning differs from textual reasoning, affecting performance.
Abstract
The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world…
Peer Reviews
Decision·ICLR 2026 Poster
1. The benchmark is carefully curated with human annotation, with evidenced distractors generated by GPT. 2. The authors evaluated a wide range of models (including GPT-5, Gemini, Claude, and open-source alternatives) and provide detailed error analysis, scaling trends, and modality impact (e.g. audio).
1. The distinction between “implicit” and “explicit” is not always clear-cut in practice. For example, in detective films, it might require both implicit and explicit clues to determine the criminals. 2. Comparsion with existing video reasoning benchmarks (e.g. VRBench) could be further discussed. [1] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos. ICCV 25. 3. To further understand MLLMs reasoning ability, it is encouraged to include the comparison based on either dif
*Clarity The paper is well-written with good structure. Hence, the clarity is basically good. *Significance This paper focuses on evaluating video reasoning capacity of MLLMs, which is an important and practical problem for video understanding. Hence, the significance is basically OK for video research community.
* Reference 1) The recent work [VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos, ICCV 2025] proposes the similar topic for video understanding. Please clarify the key difference. 2) Small suggestion: It would be clearer to include a table to show key statistics difference between this bench and the existing ones such as Video-MME, LongVideoBench, LVBench, Video-MMMU, MMVU, etc. * Method Insight 1) It woule be more interesting to investigate or indicate how to design ML
1. It is an interesting and novel step to divide the reasoning tasks to be implicit and explicit. Evaluating the model’s capability to combine the precepted information with previously learned world knowledge to understand the metaphors is important. 2. The entire benchmark is human-labeled, and several approaches are conducted from the video collection to data annotation to guarantee the overall quality to make the benchmark more reliable. 3. Experiments are comprehensive, the authors evaluate
1. The questions designed for reasoning are not as deep as the authors claim. First, all question-answer pairs are single-step and do not require reasoning chains. This makes the proposed benchmark less challenging compared with some widely adopted text or image reasoning benchmarks, which include some mathematical or scientific problems that require multi-step reasoning. Besides, for many reasoning question types, such as personal reflection, video naming, and meta-emotion, it is more likely to
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
MethodsALIGN · Focus
