AudioGen: Textually Guided Audio Generation
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre, D\'efossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

TL;DR
AudioGen is a novel autoregressive model that generates high-fidelity audio from text descriptions by leveraging data augmentation, multi-stream modeling, and classifier-free guidance, addressing key challenges in text-to-audio synthesis.
Contribution
The paper introduces AudioGen, a new text-conditioned audio generation model that uses augmentation and multi-stream techniques to improve quality and scalability.
Findings
Outperforms baseline models on multiple metrics
Effective in generating audio continuations
Handles diverse audio types and noisy conditions
Abstract
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
