Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo, Zhao, Chao Zhang

TL;DR
This paper introduces a novel Diffusion Transformer model with ControlNet for improved controllable music generation and editing, utilizing a new top-k constant-Q Transform for precise melody control and a curriculum learning strategy for stability.
Contribution
It proposes a Diffusion Transformer with ControlNet for long-form, variable-length music editing, and introduces a top-k constant-Q Transform for better melody representation, advancing controllable music synthesis.
Findings
Outperforms MusicGen baseline in text-to-music and melody preservation tasks.
Enables long-form, variable-length music editing controlled by text and melody.
Demonstrates superior control and quality in music generation and editing.
Abstract
Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for long-form and variable-length music generation and editing controlled by text and melody prompts. For more precise and fine-grained melody control, we introduce a novel top- constant-Q Transform representation as the melody prompt, reducing ambiguity compared to previous representations (e.g., chroma), particularly for music with multiple tracks or a wide range of pitch values. To effectively balance the control signals from text and melody prompts, we adopt a curriculum learning strategy that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Neural Networks and Applications · Music Technology and Sound Studies
MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
