AudioComposer: Towards Fine-grained Audio Generation with Natural   Language Descriptions

Yuanyuan Wang; Hangting Chen; Dongchao Yang; Zhiyong Wu; Xixin Wu

arXiv:2409.12560·eess.AS·April 1, 2025

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

Yuanyuan Wang, Hangting Chen, Dongchao Yang, Zhiyong Wu, Xixin Wu

PDF

Open Access

TL;DR

AudioComposer introduces a natural language-driven framework for fine-grained audio generation, effectively controlling content and style without complex reference conditions, and surpassing existing models in quality and controllability.

Contribution

The paper presents a novel TTA framework using only natural language descriptions and flow-based diffusion transformers, along with a data simulation pipeline to improve fine-grained control and data availability.

Findings

01

Outperforms state-of-the-art TTA models in quality and controllability.

02

Uses a flow-based diffusion transformer with cross-attention for effective text integration.

03

Reduces model size while maintaining superior performance.

Abstract

Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by incorporating additional frame-level conditions or control networks. However, this usually leads to complex system design and difficulties due to the requirement for reference frame-level conditions. To address these challenges, we propose AudioComposer, a novel TTA generation framework that relies solely on natural language descriptions (NLDs) to provide both content specification and style control information. To further enhance audio generative modeling, we employ flow-based diffusion transformers with the cross-attention mechanism to incorporate text descriptions effectively into audio generation processes, which can not only simultaneously consider…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsDiffusion