TAVGBench: Benchmarking Text to Audible-Video Generation
Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu, Xiang, Yiran Zhong, Yuchao Dai

TL;DR
This paper introduces TAVGBench, a large-scale benchmark for text to audible-video generation, along with a new model TAVDiffusion and an alignment metric AVHScore, to advance research in multimodal video synthesis.
Contribution
The paper provides the first comprehensive benchmark dataset, a novel alignment metric, and a baseline diffusion model for text to audible-video generation.
Findings
TAVDiffusion effectively aligns audio and video using cross-attention and contrastive learning.
The benchmark contains over 1.7 million clips, enabling extensive evaluation.
Experimental results show the model outperforms existing methods on proposed metrics.
Abstract
The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Motion and Animation · Multimedia Communication and Technology
MethodsLatent Diffusion Model · Diffusion
