Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model   Adaptation

Guy Yariv; Itai Gat; Sagie Benaim; Lior Wolf; Idan Schwartz; Yossi Adi

arXiv:2309.16429·cs.LG·September 29, 2023·2 cites

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a method for generating diverse, realistic, and well-aligned videos from audio inputs by adapting a text-to-video model with a lightweight audio adaptor, enabling multimodal conditioning and improved alignment evaluation.

Contribution

It presents a novel lightweight adaptor for audio-to-video generation, allowing alignment with audio and text, and introduces a new AV-Align metric for assessing audio-video synchronization.

Findings

01

Generated videos show better audio alignment than state-of-the-art methods.

02

Videos exhibit higher diversity and visual quality.

03

The AV-Align metric effectively measures audio-video alignment.

Abstract

We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guyyariv/TempoTokens
pytorchOfficial

Datasets

JavisVerse/JavisBench
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Video Analysis and Summarization