T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Zhe Cao; Tao Wang; Jiaming Wang; Yanghai Wang; Yuanxing Zhang; Jialu Chen; Miao Deng; Jiahao Wang; Yubin Guo; Chenxi Liao; Yize Zhang; Zhaoxiang Zhang; Jiaheng Liu

arXiv:2512.21094·cs.CV·December 25, 2025

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces T2AV-Compass, a comprehensive benchmark and evaluation framework for text-to-audio-video generation, addressing fragmented assessment methods and revealing significant gaps in current model performance.

Contribution

It presents a unified benchmark with diverse prompts and a dual-level evaluation framework combining objective metrics and subjective judgment for T2AV systems.

Findings

01

Current models underperform in realism and cross-modal alignment

02

Persistent issues in audio realism and synchronization

03

Benchmark reveals significant room for improvement

Abstract

Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

NJU-LINK/T2AV-Compass
dataset· 110 dl
110 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis