Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi

TL;DR
This paper introduces a method for generating diverse, realistic, and well-aligned videos from audio inputs by adapting a text-to-video model with a lightweight audio adaptor, enabling multimodal conditioning and improved alignment evaluation.
Contribution
It presents a novel lightweight adaptor for audio-to-video generation, allowing alignment with audio and text, and introduces a new AV-Align metric for assessing audio-video synchronization.
Findings
Generated videos show better audio alignment than state-of-the-art methods.
Videos exhibit higher diversity and visual quality.
The AV-Align metric effectively measures audio-video alignment.
Abstract
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Video Analysis and Summarization
