Audio Generation with Multiple Conditional Diffusion Model

Zhifang Guo; Jianguo Mao; Rui Tao; Long Yan; Kazushige Ouchi; Hong; Liu; Xiangdong Wang

arXiv:2308.11940·cs.SD·December 29, 2023·2 cites

Audio Generation with Multiple Conditional Diffusion Model

Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong, Liu, Xiangdong Wang

PDF

Open Access

TL;DR

This paper introduces a novel conditional diffusion model for audio generation that enhances controllability by incorporating content and style conditions, enabling fine-grained control over generated audio's temporal, pitch, and energy features.

Contribution

It proposes a new model that improves controllability of text-to-audio generation by integrating additional conditions with a trainable encoder and Fusion-Net, while keeping the pre-trained model frozen.

Findings

01

Achieves fine-grained control over audio features.

02

Demonstrates successful controllable audio generation.

03

Provides a new dataset and evaluation metrics for controllability.

Abstract

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis