Audiobox: Unified Audio Generation with Natural Language Prompts

Apoorv Vyas; Bowen Shi; Matthew Le; Andros Tjandra; Yi-Chiao Wu,; Baishan Guo; Jiemin Zhang; Xinyue Zhang; Robert Adkins; William Ngan; Jeff; Wang; Ivan Cruz; Bapi Akula; Akinniyi Akinyemi; Brian Ellis; Rashel Moritz,; Yael Yungster; Alice Rakotoarison; Liang Tan; Chris Summers; Carleigh Wood,; Joshua Lane; Mary Williamson; Wei-Ning Hsu

arXiv:2312.15821·cs.SD·December 27, 2023·5 cites

Audiobox: Unified Audio Generation with Natural Language Prompts

Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu,, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff, Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz,, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers

PDF

Open Access

TL;DR

Audiobox is a versatile, flow-matching based model that enables controllable, multi-modal audio generation from natural language prompts, achieving state-of-the-art results and supporting novel style synthesis.

Contribution

The paper introduces Audiobox, a unified audio generation model that combines description and example prompts, and incorporates self-supervised pretraining and faster solvers for improved performance and controllability.

Findings

01

Sets new benchmarks on speech and sound generation tasks.

02

Achieves over 25x faster generation with Bespoke Solvers.

03

Enables controllable synthesis of diverse audio styles.

Abstract

Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis

MethodsAttention with Linear Biases