Mustango: Toward Controllable Text-to-Music Generation
Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, Soujanya Poria

TL;DR
Mustango is a diffusion-based text-to-music system that enables detailed control over generated music using rich, music-specific prompts, supported by a novel data augmentation method and a music-domain-knowledge-informed guidance module.
Contribution
The paper introduces Mustango, a new controllable text-to-music generation system with a music-domain-guided diffusion model and a large, augmented dataset for training.
Findings
Mustango achieves state-of-the-art music quality.
Controllability with music-specific prompts outperforms existing models.
The MusicBench dataset contains over 52,000 annotated music instances.
Abstract
The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Digital Humanities and Scholarship
MethodsDiffusion
