Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion   Latent Aligners

Yazhou Xing; Yingqing He; Zeyue Tian; Xintao Wang; Qifeng Chen

arXiv:2402.17723·cs.CV·February 29, 2024·1 cites

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen

PDF

Open Access

TL;DR

This paper introduces a novel diffusion-based framework that leverages pre-trained models and a shared latent space to enable open-domain joint video-audio generation, bridging the gap between separate modalities.

Contribution

It proposes a multimodality latent aligner using ImageBind to unify video and audio generation models without training from scratch.

Findings

01

Superior performance on joint video-audio generation tasks

02

Effective visual-steered audio generation results

03

Accurate audio-steered visual generation outcomes

Abstract

Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing

MethodsDiffusion