Taming Data and Transformers for Audio Generation
Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha, Balakrishnan, Vicente Ordonez

TL;DR
This paper introduces AutoReCap-XL, AutoCap, and GenAu, a large-scale dataset, high-quality captioning model, and scalable transformer architecture for ambient audio generation, significantly improving quality and scalability.
Contribution
It presents a comprehensive approach combining data collection, captioning, and scalable modeling to advance ambient audio generation.
Findings
AutoReCap-XL dataset with over 47 million clips
AutoCap achieves CIDEr score of 83.2, 3.2% better than previous models
GenAu improves FAD by 4.7%, IS by 11.1%, and CLAP score by 13.5%
Abstract
The scalability of ambient sound generators is hindered by data scarcity, insufficient caption quality, and limited scalability in model architecture. This work addresses these challenges by advancing both data and model scaling. First, we propose an efficient and scalable dataset collection pipeline tailored for ambient audio generation, resulting in AutoReCap-XL, the largest ambient audio-text dataset with over 47 million clips. To provide high-quality textual annotations, we propose AutoCap, a high-quality automatic audio captioning model. By adopting a Q-Former module and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of , a improvement over previous captioning models. Finally, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. We demonstrate its benefits…
Peer Reviews
Decision·Submitted to ICLR 2025
- The illustrations in the paper are clear and it is well written. - No evident flaws. - AutoReCap-XL will be great asset for the audio research community if open-sourced.
- The authors shared the audio samples from Stable-Audio 1.0[7] but the comparison is missing in results table. - Typo in table 5, last row "Quality" column: "-". - The performance improvement when compared to increase in number of parameters in GenAu is marginal. - Recently, Large Audio Language models [1,2,3,4] are being employed for audio captioning task, but the authors don't compare AutoCap to these baselines which in my opinion should be an important comparison. - Inconsistent use of word
* The paper presents a simple pipeline to label audio data in order to generate larger data than ever before. * The paper shows the new dataset improves the quality of trained models. * Train a SOTA model using the new dataset. * The authors say they will release the dataset which could be a good contribution to the community.
* In scenarios where metadata lacks detail, audio captioning may struggle to disambiguate sounds accurately. The model also tends to falter in capturing the temporal relationships between sounds and differentiating foreground from background noises. * Fine-tuned on AudioCaps, which contains a limited vocabulary of 4,892 unique words. The limited vocabulary of the paired texts, even though extensive, hampers the model’s ability to accurately generate audio for long and detailed prompts. * The p
- The paper is well written and well presented. The figures are good and the writing and everything is crisp. It is a nice to read paper. - The method shows good improvements. Open-sourcing the artifacts in future would help the audio community. - Th intuitions are good. The fact that good captions can improve audio generation is a good finding and well conveyed. Although I feel some parts are over-claimed which I mention next. - I don't see many technical flaws with the paper.
I have several issues with the paper. I will first point out the technical weaknesses: - Fig. 1 says CLAP Encoder has one token. CLAP uses HTSAT as the audio encoder which also has intermediate representations. This only means that the authors used the CLS token (or some pooled representations) which is not specified. The authors should also clearly mention "CLAP audio encoder". - The caption says "We then compact this representation into 4x fewer tokens using a Q-Former (Li et al., 2023a) mod
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Speech and Audio Processing
