Taming Data and Transformers for Audio Generation

Moayed Haji-Ali; Willi Menapace; Aliaksandr Siarohin; Guha; Balakrishnan; Vicente Ordonez

arXiv:2406.19388·cs.SD·April 17, 2025

Taming Data and Transformers for Audio Generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha, Balakrishnan, Vicente Ordonez

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces AutoReCap-XL, AutoCap, and GenAu, a large-scale dataset, high-quality captioning model, and scalable transformer architecture for ambient audio generation, significantly improving quality and scalability.

Contribution

It presents a comprehensive approach combining data collection, captioning, and scalable modeling to advance ambient audio generation.

Findings

01

AutoReCap-XL dataset with over 47 million clips

02

AutoCap achieves CIDEr score of 83.2, 3.2% better than previous models

03

GenAu improves FAD by 4.7%, IS by 11.1%, and CLAP score by 13.5%

Abstract

The scalability of ambient sound generators is hindered by data scarcity, insufficient caption quality, and limited scalability in model architecture. This work addresses these challenges by advancing both data and model scaling. First, we propose an efficient and scalable dataset collection pipeline tailored for ambient audio generation, resulting in AutoReCap-XL, the largest ambient audio-text dataset with over 47 million clips. To provide high-quality textual annotations, we propose AutoCap, a high-quality automatic audio captioning model. By adopting a Q-Former module and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of $83.2$ , a $3.2%$ improvement over previous captioning models. Finally, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. We demonstrate its benefits…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 5

Strengths

- The illustrations in the paper are clear and it is well written. - No evident flaws. - AutoReCap-XL will be great asset for the audio research community if open-sourced.

Weaknesses

- The authors shared the audio samples from Stable-Audio 1.0[7] but the comparison is missing in results table. - Typo in table 5, last row "Quality" column: "-". - The performance improvement when compared to increase in number of parameters in GenAu is marginal. - Recently, Large Audio Language models [1,2,3,4] are being employed for audio captioning task, but the authors don't compare AutoCap to these baselines which in my opinion should be an important comparison. - Inconsistent use of word

Reviewer 02Rating 6Confidence 4

Strengths

* The paper presents a simple pipeline to label audio data in order to generate larger data than ever before. * The paper shows the new dataset improves the quality of trained models. * Train a SOTA model using the new dataset. * The authors say they will release the dataset which could be a good contribution to the community.

Weaknesses

* In scenarios where metadata lacks detail, audio captioning may struggle to disambiguate sounds accurately. The model also tends to falter in capturing the temporal relationships between sounds and differentiating foreground from background noises. * Fine-tuned on AudioCaps, which contains a limited vocabulary of 4,892 unique words. The limited vocabulary of the paired texts, even though extensive, hampers the model’s ability to accurately generate audio for long and detailed prompts. * The p

Reviewer 03Rating 3Confidence 5

Strengths

- The paper is well written and well presented. The figures are good and the writing and everything is crisp. It is a nice to read paper. - The method shows good improvements. Open-sourcing the artifacts in future would help the audio community. - Th intuitions are good. The fact that good captions can improve audio generation is a good finding and well conveyed. Although I feel some parts are over-claimed which I mention next. - I don't see many technical flaws with the paper.

Weaknesses

I have several issues with the paper. I will first point out the technical weaknesses: - Fig. 1 says CLAP Encoder has one token. CLAP uses HTSAT as the audio encoder which also has intermediate representations. This only means that the authors used the CLS token (or some pooled representations) which is not specified. The authors should also clearly mention "CLAP audio encoder". - The caption says "We then compact this representation into 4x fewer tokens using a Q-Former (Li et al., 2023a) mod

Code & Models

Repositories

snap-research/GenAU
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Speech and Audio Processing