Improving Text-To-Audio Models with Synthetic Captions

Zhifeng Kong; Sang-gil Lee; Deepanway Ghosal; Navonil Majumder; Ambuj; Mehrish; Rafael Valle; Soujanya Poria; Bryan Catanzaro

arXiv:2406.15487·cs.CL·July 10, 2024·1 cites

Improving Text-To-Audio Models with Synthetic Captions

Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj, Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces an audio captioning pipeline using an audio language model to generate synthetic captions, enhancing training data for text-to-audio models and significantly improving audio generation quality.

Contribution

It presents a novel audio captioning pipeline with an audio language model, enabling large-scale synthetic caption generation for audio datasets.

Findings

01

Synthetic captions improve text-to-audio model performance

02

Achieved state-of-the-art results on AudioCaps and MusicCaps

03

Demonstrated scalability and diversity in caption synthesis

Abstract

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

declare-lab/tango
pytorch

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Video Analysis and Summarization