MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Yujie Wei; Yujin Han; Zhekai Chen; Yongming Li; Kaixun Jiang; Zhihang Liu; Quanhao Li; Zhiwu Qing; Xiang Wang; Zhen Xing; Ruihang Chu; Lingyi Hong; Yefei He; Junjie Zhou; Junqiu Yu; Yang Shi; Difan Zou; Kai Zhu; Shiwei Zhang; Yingya Zhang; Yu Liu; Xihui Liu; Hongming Shan

arXiv:2605.20183·cs.CV·May 20, 2026

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Yujie Wei, Yujin Han, Zhekai Chen, Yongming Li, Kaixun Jiang, Zhihang Liu, Quanhao Li, Zhiwu Qing, Xiang Wang, Zhen Xing, Ruihang Chu, Lingyi Hong, Yefei He, Junjie Zhou, Junqiu Yu, Yang Shi, Difan Zou, Kai Zhu, Shiwei Zhang, Yingya Zhang, Yu Liu, Xihui Liu, Hongming Shan

PDF

TL;DR

MSAVBench is a comprehensive benchmark and evaluation framework designed to systematically assess multi-shot audio-video generation models across diverse scenarios, addressing current evaluation limitations.

Contribution

The paper introduces MSAVBench, the first adaptive, multi-dimensional benchmark with an evaluation framework that aligns closely with human judgments for MSAV models.

Findings

01

High correlation (91.5%) with human judgments.

02

Current models struggle with director control and synchronization.

03

Modular pipelines show promising improvements.

Abstract

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.