Retrieval-Augmented Text-to-Audio Generation
Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu, Wang

TL;DR
This paper introduces Re-AudioLDM, a retrieval-augmented method for text-to-audio generation that improves performance on rare and unseen audio classes by leveraging retrieved data to guide generation.
Contribution
It proposes a retrieval-augmented approach for TTA models, enhancing AudioLDM with retrieved data to address class imbalance and improve generation quality.
Findings
Achieves state-of-the-art FAD of 1.37 on AudioCaps
Generates realistic audio for complex scenes and rare classes
Outperforms existing methods significantly
Abstract
Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
