Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu; Haozhi Yuan; Yuhao Dong; Yi-Fan Zhang; Yunhang Shen; Xiaoxing Hu; Xueying Li; Jinsen Su; Chengwu Long; Xiaoyao Xie; Yongkang Xie; Xiawu Zheng; Xue Yang; Haoyu Cao; Yunsheng Wu; Ziwei Liu; Xing Sun; Caifeng Shan; Ran He

arXiv:2604.05015·cs.CV·April 8, 2026

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He

PDF

1 Repo 3 Datasets

TL;DR

Video-MME-v2 introduces a rigorous, multi-level benchmark for comprehensive video understanding, emphasizing robustness, reasoning, and data quality to bridge the gap between model scores and real-world capabilities.

Contribution

It proposes a novel hierarchical evaluation framework and group-based scoring strategy, along with a high-quality, human-annotated dataset for advancing video understanding models.

Findings

01

Current models lag behind human performance on Video-MME-v2.

02

Errors in visual and temporal reasoning limit high-level understanding.

03

Textual cues like subtitles can improve or sometimes hinder visual reasoning.

Abstract

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mme-benchmarks/Video-MME-v2
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.