TAVGBench: Benchmarking Text to Audible-Video Generation

Yuxin Mao; Xuyang Shen; Jing Zhang; Zhen Qin; Jinxing Zhou; Mochu; Xiang; Yiran Zhong; Yuchao Dai

arXiv:2404.14381·cs.CV·April 23, 2024

TAVGBench: Benchmarking Text to Audible-Video Generation

Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu, Xiang, Yiran Zhong, Yuchao Dai

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper introduces TAVGBench, a large-scale benchmark for text to audible-video generation, along with a new model TAVDiffusion and an alignment metric AVHScore, to advance research in multimodal video synthesis.

Contribution

The paper provides the first comprehensive benchmark dataset, a novel alignment metric, and a baseline diffusion model for text to audible-video generation.

Findings

01

TAVDiffusion effectively aligns audio and video using cross-attention and contrastive learning.

02

The benchmark contains over 1.7 million clips, enabling extensive evaluation.

03

Experimental results show the model outperforms existing methods on proposed metrics.

Abstract

The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opennlplab/tavgbench
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Motion and Animation · Multimedia Communication and Technology

MethodsLatent Diffusion Model · Diffusion