MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng; Haochen Wang; Yuanxing Zhang; Zekun Wang; Zili Wang; Gavin Chang; Jian Yang; Shihao Li; Yanghai Wang; Xintao Wang; Houyi Li; Wei Ji; Pengfei Wan; Steven Huang; Zhaoxiang Zhang; Jiaheng Liu

arXiv:2511.07250·cs.CV·November 14, 2025

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Gavin Chang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, Houyi Li, Wei Ji, Pengfei Wan, Steven Huang, Zhaoxiang Zhang, Jiaheng Liu

PDF

Open Access 1 Datasets

TL;DR

MVU-Eval introduces the first comprehensive benchmark to evaluate multi-video understanding in multimodal large language models, addressing a critical gap for real-world applications like autonomous driving and sports analytics.

Contribution

This work presents MVU-Eval, a novel benchmark with 1,824 question-answer pairs across nearly 5,000 videos, specifically designed to assess multi-video understanding in MLLMs, which was previously unaddressed.

Findings

01

Current MLLMs show significant performance gaps in multi-video understanding.

02

The benchmark reveals limitations in existing models' ability to handle multi-video tasks.

03

Evaluation highlights the need for improved multi-video reasoning capabilities.

Abstract

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MVU-Eval-Team/MVU-Eval-Data
dataset· 225 dl
225 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Human Pose and Action Recognition