MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation

Haoyuan Shi; Yunxin Li; Nanhao Deng; Zhenran Xu; Xinyu Chen; Longyue Wang; Baotian Hu; Min Zhang

arXiv:2602.23969·cs.MM·March 2, 2026

MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation

Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang

PDF

Open Access

TL;DR

MSVBench introduces a comprehensive multi-shot video generation benchmark with hierarchical scripts and a hybrid evaluation framework, revealing current models' limitations and enabling scalable, human-aligned improvements.

Contribution

It presents the first multi-shot video generation benchmark with hierarchical scripts and a hybrid evaluation method, advancing beyond single-shot paradigms.

Findings

01

Current models act as visual interpolators, not true world models.

02

MSVBench achieves 94.4% correlation with human judgments.

03

Fine-tuning on MSVBench's signals yields human-aligned performance.

Abstract

The evolution of video generation toward complex, multi-shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single-shot paradigms, lacking the comprehensive story assets and cross-shot metrics required to assess long-form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation. We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models (LMMs) with the fine-grained perceptual rigor of domain-specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models--despite strong visual fidelity--primarily behave as visual interpolators rather than true world models. We further validate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization