ETTA: Elucidating the Design Space of Text-to-Audio Models
Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

TL;DR
This paper systematically explores the design space of text-to-audio models, introducing a large synthetic dataset, comparing various model choices, and proposing ETTA, which outperforms baselines and excels in creative audio generation.
Contribution
It provides a comprehensive empirical analysis of TTA model components and introduces ETTA, a new model that advances quality and creative capabilities.
Findings
ETTA outperforms baseline models on AudioCaps and MusicCaps.
Sampling strategies significantly impact quality and speed trade-offs.
Synthetic dataset AF-Synthetic enhances training and evaluation.
Abstract
Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing
MethodsSparse Evolutionary Training · Diffusion
