MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Kejian Zhu; Zhuoran Jin; Hongbang Yuan; Jiachun Li; Shangqing Tu; Pengfei Cao; Yubo Chen; Kang Liu; Jun Zhao

arXiv:2506.04141·cs.CV·June 5, 2025

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

MMR-V introduces a challenging benchmark for evaluating multimodal deep reasoning in videos, emphasizing long-range, multi-frame inference and reasoning beyond perception, revealing current models' limitations.

Contribution

This paper presents MMR-V, a novel benchmark designed to evaluate deep multimodal reasoning in videos, addressing gaps in existing datasets by focusing on long-range, hidden, and confusable reasoning tasks.

Findings

01

Current models achieve only around 52.5% accuracy on MMR-V.

02

Reasoning strategies like Chain-of-Thought have limited impact.

03

Multi-modal reasoning differs from textual reasoning, affecting performance.

Abstract

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. The benchmark is carefully curated with human annotation, with evidenced distractors generated by GPT. 2. The authors evaluated a wide range of models (including GPT-5, Gemini, Claude, and open-source alternatives) and provide detailed error analysis, scaling trends, and modality impact (e.g. audio).

Weaknesses

1. The distinction between “implicit” and “explicit” is not always clear-cut in practice. For example, in detective films, it might require both implicit and explicit clues to determine the criminals. 2. Comparsion with existing video reasoning benchmarks (e.g. VRBench) could be further discussed. [1] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos. ICCV 25. 3. To further understand MLLMs reasoning ability, it is encouraged to include the comparison based on either dif

Reviewer 02Rating 4Confidence 3

Strengths

*Clarity The paper is well-written with good structure. Hence, the clarity is basically good. *Significance This paper focuses on evaluating video reasoning capacity of MLLMs, which is an important and practical problem for video understanding. Hence, the significance is basically OK for video research community.

Weaknesses

* Reference 1) The recent work [VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos, ICCV 2025] proposes the similar topic for video understanding. Please clarify the key difference. 2) Small suggestion: It would be clearer to include a table to show key statistics difference between this bench and the existing ones such as Video-MME, LongVideoBench, LVBench, Video-MMMU, MMVU, etc. * Method Insight 1) It woule be more interesting to investigate or indicate how to design ML

Reviewer 03Rating 6Confidence 5

Strengths

1. It is an interesting and novel step to divide the reasoning tasks to be implicit and explicit. Evaluating the model’s capability to combine the precepted information with previously learned world knowledge to understand the metaphors is important. 2. The entire benchmark is human-labeled, and several approaches are conducted from the video collection to data annotation to guarantee the overall quality to make the benchmark more reliable. 3. Experiments are comprehensive, the authors evaluate

Weaknesses

1. The questions designed for reasoning are not as deep as the authors claim. First, all question-answer pairs are single-step and do not require reasoning chains. This makes the proposed benchmark less challenging compared with some widely adopted text or image reasoning benchmarks, which include some mathematical or scientific problems that require multi-step reasoning. Besides, for many reasoning question types, such as personal reflection, video naming, and meta-emotion, it is more likely to

Code & Models

Datasets

JokerJan/MMR-VBench
dataset· 499 dl
499 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling

MethodsALIGN · Focus