Retrieval-Augmented Text-to-Audio Generation

Yi Yuan; Haohe Liu; Xubo Liu; Qiushi Huang; Mark D. Plumbley; Wenwu; Wang

arXiv:2309.08051·cs.SD·January 8, 2024

Retrieval-Augmented Text-to-Audio Generation

Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu, Wang

PDF

Open Access

TL;DR

This paper introduces Re-AudioLDM, a retrieval-augmented method for text-to-audio generation that improves performance on rare and unseen audio classes by leveraging retrieved data to guide generation.

Contribution

It proposes a retrieval-augmented approach for TTA models, enhancing AudioLDM with retrieved data to address class imbalance and improve generation quality.

Findings

01

Achieves state-of-the-art FAD of 1.37 on AudioCaps

02

Generates realistic audio for complex scenes and rare classes

03

Outperforms existing methods significantly

Abstract

Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis