AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Haohe Liu; Zehua Chen; Yi Yuan; Xinhao Mei; Xubo Liu; Danilo Mandic,; Wenwu Wang; Mark D. Plumbley

arXiv:2301.12503·cs.SD·September 12, 2023·81 cites

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic,, Wenwu Wang, Mark D. Plumbley

PDF

Open Access 4 Repos 6 Models

TL;DR

AudioLDM introduces a novel latent diffusion model for text-to-audio generation, achieving high quality and efficiency, and enabling zero-shot audio manipulation based on text prompts.

Contribution

The paper presents AudioLDM, a new TTA system leveraging latent space learning and contrastive pretraining for improved quality and computational efficiency, with zero-shot manipulation capabilities.

Findings

01

Achieves state-of-the-art TTA performance on AudioCaps.

02

Enables various text-guided audio manipulations in zero-shot setting.

03

Operates efficiently on a single GPU.

Abstract

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis