Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Yixiao Zhang; Yukara Ikemiya; Woosung Choi; Naoki Murata; Marco A. Mart\'inez-Ram\'irez; Liwei Lin; Gus Xia; Wei-Hsiang Liao; Yuki Mitsufuji; Simon Dixon

arXiv:2405.18386·cs.SD·July 21, 2025

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Mart\'inez-Ram\'irez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

PDF

Open Access 2 Repos

TL;DR

Instruct-MusicGen is a fine-tuned music language model that efficiently follows editing instructions, enabling high-quality, flexible text-to-music editing with minimal additional parameters and training.

Contribution

The paper introduces a novel instruction tuning approach for MusicGen, incorporating text and audio fusion modules, significantly improving editing capabilities with minimal parameter increase.

Findings

01

Achieves superior editing performance compared to baselines.

02

Only 8% new parameters and 5K training steps needed.

03

Performs comparably to task-specific models.

Abstract

Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis