VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

Qian Zhang; Yuqin Cao; Yixuan Gao; Xiongkuo Min

arXiv:2604.10542·cs.SD·April 14, 2026

VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

Qian Zhang, Yuqin Cao, Yixuan Gao, Xiongkuo Min

PDF

1 Datasets

TL;DR

VidAudio-Bench is a comprehensive benchmark for evaluating video-to-audio and video-text-to-audio generation across four audio categories, introducing new metrics and revealing current model limitations.

Contribution

It introduces a multi-task benchmark with extensive evaluation metrics and validation, addressing the lack of fine-grained assessment in V2A and VT2A systems.

Findings

01

Current models perform poorly in speech and singing generation.

02

Visual conditioning improves video-audio alignment but may reduce audio category accuracy.

03

The benchmark provides insights into the trade-offs in multimodal audio generation.

Abstract

Video-to-Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine-grained requirements of distinct audio categories. To address this gap, we propose VidAudio-Bench, a multi-task benchmark for V2A evaluation with four key features: (1) Broad Coverage: It encompasses four representative audio categories - sound effects, music, speech, and singing - under both V2A and Video-Text-to-Audio (VT2A) settings. (2) Extensive Evaluation: It comprises 1,634 video-text pairs and benchmarks 11 state-of-the-art generation models. (3) Comprehensive Metrics: It introduces 13 task-specific, reference-free metrics to systematically assess audio quality, video-audio consistency, and text-audio consistency. (4) Human Alignment: It validates all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

QianZhang17/VidAudio-Bench
dataset· 1.7k dl
1.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.