ETTA: Elucidating the Design Space of Text-to-Audio Models

Sang-gil Lee; Zhifeng Kong; Arushi Goel; Sungwon Kim; Rafael Valle; Bryan Catanzaro

arXiv:2412.19351·cs.SD·July 2, 2025

ETTA: Elucidating the Design Space of Text-to-Audio Models

Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper systematically explores the design space of text-to-audio models, introducing a large synthetic dataset, comparing various model choices, and proposing ETTA, which outperforms baselines and excels in creative audio generation.

Contribution

It provides a comprehensive empirical analysis of TTA model components and introduces ETTA, a new model that advances quality and creative capabilities.

Findings

01

ETTA outperforms baseline models on AudioCaps and MusicCaps.

02

Sampling strategies significantly impact quality and speed trade-offs.

03

Synthetic dataset AF-Synthetic enhances training and evaluation.

Abstract

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVIDIA/elucidated-text-to-audio
none

Videos

ETTA: Elucidating the Design Space of Text-to-Audio Models· slideslive

Taxonomy

TopicsMusic and Audio Processing

MethodsSparse Evolutionary Training · Diffusion