MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video   Understanding

Xinyu Fang; Kangrui Mao; Haodong Duan; Xiangyu Zhao; Yining Li; Dahua; Lin; Kai Chen

arXiv:2406.14515·cs.CV·October 31, 2024·3 cites

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua, Lin, Kai Chen

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

MMBench-Video is a comprehensive benchmark designed to evaluate large vision-language models' ability to understand long-form, multi-shot videos with complex temporal reasoning, using human-annotated questions and GPT-4 assessment.

Contribution

This paper introduces MMBench-Video, a novel benchmark that assesses LVLMs' proficiency in holistic video understanding, emphasizing long videos, free-form questions, and temporal reasoning.

Findings

01

GPT-4 outperforms previous evaluation methods in accuracy and robustness.

02

Proprietary and open-source LVLMs show varying performance on the benchmark.

03

MMBench-Video provides a new standard for evaluating video understanding models.

Abstract

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-compass/vlmevalkit
pytorchOfficial

Datasets

opencompass/MMBench-Video
dataset· 524 dl
524 dl

Videos

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding· slideslive

Taxonomy

TopicsVideo Analysis and Summarization

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings