Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze, Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao

TL;DR
Make-An-Audio introduces a prompt-enhanced diffusion model for text-to-audio generation, overcoming data scarcity and modeling challenges, achieving state-of-the-art results and enabling controllable, modality-agnostic audio synthesis.
Contribution
The paper presents a novel prompt-enhanced diffusion approach with pseudo prompt augmentation and spectrogram autoencoder, advancing text-to-audio generation and modality versatility.
Findings
Achieves state-of-the-art objective and subjective performance
Demonstrates controllability and generalization across modalities
Enables high-fidelity audio generation from various inputs
Abstract
Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
