AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic,, Wenwu Wang, Mark D. Plumbley

TL;DR
AudioLDM introduces a novel latent diffusion model for text-to-audio generation, achieving high quality and efficiency, and enabling zero-shot audio manipulation based on text prompts.
Contribution
The paper presents AudioLDM, a new TTA system leveraging latent space learning and contrastive pretraining for improved quality and computational efficiency, with zero-shot manipulation capabilities.
Findings
Achieves state-of-the-art TTA performance on AudioCaps.
Enables various text-guided audio manipulations in zero-shot setting.
Operates efficiently on a single GPU.
Abstract
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
