Audio Generation with Multiple Conditional Diffusion Model
Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong, Liu, Xiangdong Wang

TL;DR
This paper introduces a novel conditional diffusion model for audio generation that enhances controllability by incorporating content and style conditions, enabling fine-grained control over generated audio's temporal, pitch, and energy features.
Contribution
It proposes a new model that improves controllability of text-to-audio generation by integrating additional conditions with a trainable encoder and Fusion-Net, while keeping the pre-trained model frozen.
Findings
Achieves fine-grained control over audio features.
Demonstrates successful controllable audio generation.
Provides a new dataset and evaluation metrics for controllability.
Abstract
Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
