AudioGen: Textually Guided Audio Generation

Felix Kreuk; Gabriel Synnaeve; Adam Polyak; Uriel Singer; Alexandre; D\'efossez; Jade Copet; Devi Parikh; Yaniv Taigman; Yossi Adi

arXiv:2209.15352·cs.SD·March 7, 2023·54 cites

AudioGen: Textually Guided Audio Generation

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre, D\'efossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

PDF

Open Access 1 Repo 4 Models 1 Video

TL;DR

AudioGen is a novel autoregressive model that generates high-fidelity audio from text descriptions by leveraging data augmentation, multi-stream modeling, and classifier-free guidance, addressing key challenges in text-to-audio synthesis.

Contribution

The paper introduces AudioGen, a new text-conditioned audio generation model that uses augmentation and multi-stream techniques to improve quality and scalability.

Findings

01

Outperforms baseline models on multiple metrics

02

Effective in generating audio continuations

03

Handles diverse audio types and noisy conditions

Abstract

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/audiocraft
pytorch

Models

Videos

AudioGen: Textually Guided Audio Generation· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis