AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian,, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

TL;DR
AudioLDM 2 introduces a unified framework for diverse audio generation tasks using a shared representation called 'language of audio' (LOA), leveraging self-supervised pretraining and diffusion models to achieve state-of-the-art results.
Contribution
The paper proposes a novel unified framework that uses LOA and self-supervised pretraining for multiple audio generation tasks, enabling in-context learning and improved performance.
Findings
Achieves state-of-the-art results on text-to-audio, text-to-music, and text-to-speech benchmarks.
Introduces a universal audio representation called LOA based on AudioMAE.
Demonstrates effective cross-modal translation and generation capabilities.
Abstract
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗haoheliu/audioldm2-fullmodel· ♡ 6♡ 6
- 🤗cvssp/audioldm2model· 13k dl· ♡ 6413k dl♡ 64
- 🤗cvssp/audioldm2-largemodel· 211k dl· ♡ 18211k dl♡ 18
- 🤗cvssp/audioldm2-musicmodel· 1.7k dl· ♡ 281.7k dl♡ 28
- 🤗anhnct/audioldm2_gigaspeechmodel· 34 dl· ♡ 1134 dl♡ 11
- 🤗vtrungnhan9/audioldm2-music-zac2023model· 2 dl· ♡ 12 dl♡ 1
- 🤗anhnct/audioldm2_ljspeechmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗jdp8/audioldm2model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Layer Normalization · Discriminative Fine-Tuning · Adam · Residual Connection · Dense Connections
