MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I. Denk, Zal\'an Borsos, Jesse Engel, Mauro, Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco, Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

TL;DR
MusicLM is a hierarchical model that generates high-quality, long-duration music from text descriptions, outperforming previous systems and supporting multi-modal conditioning with melodies.
Contribution
Introducing MusicLM, a novel hierarchical sequence-to-sequence model for text-to-music generation that produces high-fidelity, consistent music and supports conditioning on both text and melodies.
Findings
MusicLM outperforms previous systems in audio quality.
MusicLM maintains consistency over several minutes.
MusicLM can transform hummed melodies based on text descriptions.
Abstract
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
MethodsAdam · 1-bit Adam
