MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua, Lin, Kai Chen

TL;DR
MMBench-Video is a comprehensive benchmark designed to evaluate large vision-language models' ability to understand long-form, multi-shot videos with complex temporal reasoning, using human-annotated questions and GPT-4 assessment.
Contribution
This paper introduces MMBench-Video, a novel benchmark that assesses LVLMs' proficiency in holistic video understanding, emphasizing long videos, free-form questions, and temporal reasoning.
Findings
GPT-4 outperforms previous evaluation methods in accuracy and robustness.
Proprietary and open-source LVLMs show varying performance on the benchmark.
MMBench-Video provides a new standard for evaluating video understanding models.
Abstract
The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization
MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings
