Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion   Models

Rongjie Huang; Jiawei Huang; Dongchao Yang; Yi Ren; Luping Liu; Mingze; Li; Zhenhui Ye; Jinglin Liu; Xiang Yin; Zhou Zhao

arXiv:2301.12661·cs.SD·January 31, 2023·47 cites

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze, Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

Make-An-Audio introduces a prompt-enhanced diffusion model for text-to-audio generation, overcoming data scarcity and modeling challenges, achieving state-of-the-art results and enabling controllable, modality-agnostic audio synthesis.

Contribution

The paper presents a novel prompt-enhanced diffusion approach with pseudo prompt augmentation and spectrogram autoencoder, advancing text-to-audio generation and modality versatility.

Findings

01

Achieves state-of-the-art objective and subjective performance

02

Demonstrates controllability and generalization across modalities

03

Enables high-fidelity audio generation from various inputs

Abstract

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

text-to-audio/make-an-audio
pytorch

Models

🤗
AIGC-Audio/Make-An-Audio-3
model· ♡ 14
♡ 14

Videos

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models· slideslive

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion