MusicLM: Generating Music From Text

Andrea Agostinelli; Timo I. Denk; Zal\'an Borsos; Jesse Engel; Mauro; Verzetti; Antoine Caillon; Qingqing Huang; Aren Jansen; Adam Roberts; Marco; Tagliasacchi; Matt Sharifi; Neil Zeghidour; Christian Frank

arXiv:2301.11325·cs.SD·January 27, 2023·182 cites

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I. Denk, Zal\'an Borsos, Jesse Engel, Mauro, Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco, Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

PDF

Open Access 5 Repos 1 Models 5 Datasets

TL;DR

MusicLM is a hierarchical model that generates high-quality, long-duration music from text descriptions, outperforming previous systems and supporting multi-modal conditioning with melodies.

Contribution

Introducing MusicLM, a novel hierarchical sequence-to-sequence model for text-to-music generation that produces high-fidelity, consistent music and supports conditioning on both text and melodies.

Findings

01

MusicLM outperforms previous systems in audio quality.

02

MusicLM maintains consistency over several minutes.

03

MusicLM can transform hummed melodies based on text descriptions.

Abstract

We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
google/magenta-realtime
model· 261 dl· ♡ 545
261 dl♡ 545

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis

MethodsAdam · 1-bit Adam