Mustango: Toward Controllable Text-to-Music Generation

Jan Melechovsky; Zixun Guo; Deepanway Ghosal; Navonil Majumder; Dorien Herremans; Soujanya Poria

arXiv:2311.08355·eess.AS·June 18, 2025·ACL·1 cites

Mustango: Toward Controllable Text-to-Music Generation

Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, Soujanya Poria

PDF

Open Access 4 Repos 3 Models 1 Datasets 1 Video

TL;DR

Mustango is a diffusion-based text-to-music system that enables detailed control over generated music using rich, music-specific prompts, supported by a novel data augmentation method and a music-domain-knowledge-informed guidance module.

Contribution

The paper introduces Mustango, a new controllable text-to-music generation system with a music-domain-guided diffusion model and a large, augmented dataset for training.

Findings

01

Mustango achieves state-of-the-art music quality.

02

Controllability with music-specific prompts outperforms existing models.

03

The MusicBench dataset contains over 52,000 annotated music instances.

Abstract

The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

amaai-lab/MusicBench
dataset· 554 dl
554 dl

Videos

Mustango: Toward Controllable Text-to-Music Generation· underline

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Digital Humanities and Scholarship

MethodsDiffusion