MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation
Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang

TL;DR
MSVBench introduces a comprehensive multi-shot video generation benchmark with hierarchical scripts and a hybrid evaluation framework, revealing current models' limitations and enabling scalable, human-aligned improvements.
Contribution
It presents the first multi-shot video generation benchmark with hierarchical scripts and a hybrid evaluation method, advancing beyond single-shot paradigms.
Findings
Current models act as visual interpolators, not true world models.
MSVBench achieves 94.4% correlation with human judgments.
Fine-tuning on MSVBench's signals yields human-aligned performance.
Abstract
The evolution of video generation toward complex, multi-shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single-shot paradigms, lacking the comprehensive story assets and cross-shot metrics required to assess long-form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation. We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models (LMMs) with the fine-grained perceptual rigor of domain-specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models--despite strong visual fidelity--primarily behave as visual interpolators rather than true world models. We further validate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization
