Audio Conditioning for Music Generation via Discrete Bottleneck Features

Simon Rouard; Yossi Adi; Jade Copet; Axel Roebel; Alexandre D\'efossez

arXiv:2407.12563·cs.SD·July 31, 2024

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Simon Rouard, Yossi Adi, Jade Copet, Axel Roebel, Alexandre D\'efossez

PDF

Open Access 1 Models

TL;DR

This paper introduces a novel approach to music generation by conditioning language models with audio input through two strategies: textual inversion and joint training with audio features, validated by human and automatic evaluations.

Contribution

It proposes two innovative methods for audio conditioning in music generation, including a novel double classifier free guidance technique for mixing textual and audio inputs.

Findings

01

Successful audio conditioning of music generation models.

02

Effective mixing of textual and audio conditioning.

03

Validated quality through human and automatic evaluations.

Abstract

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
facebook/musicgen-style
model· ♡ 18
♡ 18

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies