AudioLDM 2: Learning Holistic Audio Generation with Self-supervised   Pretraining

Haohe Liu; Yi Yuan; Xubo Liu; Xinhao Mei; Qiuqiang Kong; Qiao Tian,; Yuping Wang; Wenwu Wang; Yuxuan Wang; Mark D. Plumbley

arXiv:2308.05734·cs.SD·May 14, 2024·5 cites

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian,, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

PDF

Open Access 2 Repos 8 Models

TL;DR

AudioLDM 2 introduces a unified framework for diverse audio generation tasks using a shared representation called 'language of audio' (LOA), leveraging self-supervised pretraining and diffusion models to achieve state-of-the-art results.

Contribution

The paper proposes a novel unified framework that uses LOA and self-supervised pretraining for multiple audio generation tasks, enabling in-context learning and improved performance.

Findings

01

Achieves state-of-the-art results on text-to-audio, text-to-music, and text-to-speech benchmarks.

02

Introduces a universal audio representation called LOA based on AudioMAE.

03

Demonstrates effective cross-modal translation and generation capabilities.

Abstract

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Layer Normalization · Discriminative Fine-Tuning · Adam · Residual Connection · Dense Connections