Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Mu Yang; Bowen Shi; Matthew Le; Wei-Ning Hsu; Andros Tjandra

arXiv:2411.05141·eess.AS·June 9, 2025

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Mu Yang, Bowen Shi, Matthew Le, Wei-Ning Hsu, Andros Tjandra

PDF

Open Access

TL;DR

This paper introduces Audiobox TTA-RAG, a retrieval-augmented approach that enhances zero-shot and few-shot text-to-audio generation by leveraging retrieved audio samples, significantly improving performance without needing labeled external databases.

Contribution

The paper presents a novel retrieval-augmented TTA method based on Audiobox, extending generation with retrieved audio to improve zero-shot and few-shot capabilities.

Findings

01

Significant performance improvements on multiple metrics

02

Effective leveraging of retrieved audio samples

03

Maintains semantic alignment in in-domain generation

Abstract

This work focuses on improving Text-To-Audio (TTA) generation on zero-shot and few-shot settings (i.e. generating unseen or uncommon audio events). Inspired by the success of Retrieval-Augmented Generation (RAG) in Large Language Models, we propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a flow-matching audio generation model. Unlike the vanilla Audiobox TTA solution that generates audio conditioned on text only, we extend the TTA process by augmenting the conditioning input with both text and retrieved audio samples. Our retrieval method does not require the external database to have labeled audio, offering more practical use cases. We show that the proposed model can effectively leverage the retrieved audio samples and significantly improve zero-shot and few-shot TTA performance, with large margins on multiple evaluation metrics, while maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing