AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou; Zeyuan Lai; Rui Wang; Yifan Yang; Zhen Xing; Yuqing Yang; Qi Dai; Lili Qiu; and Chong Luo

arXiv:2604.08540·cs.CV·April 10, 2026

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, and Chong Luo

PDF

2 Repos 1 Datasets

TL;DR

AVGen-Bench introduces a comprehensive, task-driven benchmark for evaluating multi-granular aspects of text-to-audio-video generation, addressing the limitations of existing coarse evaluation methods.

Contribution

It presents a new benchmark with high-quality prompts and a multi-granular evaluation framework combining specialist models and MLLMs for detailed assessment.

Findings

01

Significant gap between aesthetic quality and semantic reliability.

02

Persistent failures in text rendering, speech coherence, and physical reasoning.

03

Universal breakdown in musical pitch control.

Abstract

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

microsoft/AVGen-Bench
dataset· 14k dl
14k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.