Simple and Controllable Music Generation
Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel, Synnaeve, Yossi Adi, Alexandre D\'efossez

TL;DR
MusicGen is a novel single-transformer model for conditional music generation that produces high-quality mono and stereo samples controlled by text or melodic features, outperforming baselines.
Contribution
Introduces MusicGen, a single-stage transformer model with efficient token interleaving for controllable music generation, simplifying previous multi-model approaches.
Findings
MusicGen outperforms baselines on a standard text-to-music benchmark.
The model generates high-quality mono and stereo music conditioned on text or melodic features.
Ablation studies highlight the importance of each component in MusicGen.
Abstract
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗WhisperSpeech/WhisperSpeechmodel· ♡ 250♡ 250
- 🤗facebook/musicgen-stereo-largemodel· 5.6k dl· ♡ 925.6k dl♡ 92
- 🤗facebook/musicgen-melodymodel· 4.3k dl· ♡ 2514.3k dl♡ 251
- 🤗facebook/musicgen-smallmodel· 111k dl· ♡ 480111k dl♡ 480
- 🤗facebook/musicgen-mediummodel· 1.4M dl· ♡ 1581.4M dl♡ 158
- 🤗facebook/musicgen-largemodel· 16k dl· ♡ 52516k dl♡ 525
- 🤗facebook/encodec_32khzmodel· 113k dl· ♡ 19113k dl♡ 19
- 🤗facebook/audiogen-mediummodel· 24k dl· ♡ 14124k dl♡ 141
- 🤗reach-vb/musicgen-large-endpointmodel· 16 dl· ♡ 116 dl♡ 1
- 🤗reach-vb/musicgen-small-endpointmodel· 4 dl· ♡ 14 dl♡ 1
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
