MEDIC: Zero-shot Music Editing with Disentangled Inversion Control
Huadai Liu, Jialei Wang, Xiangtai Li, Wen Wang, Qian Chen, Rongjie Huang, Yang Liu, Jiayang Xu, Zhou Zhao

TL;DR
MEDIC introduces a zero-shot music editing system that leverages Disentangled Inversion Control to improve editing fidelity and content preservation in complex music edits guided by text prompts.
Contribution
The paper proposes Disentangled Inversion Control and Harmonized Attention Control techniques for improved zero-shot music editing.
Findings
Outperforms existing inversion methods in fidelity and content preservation.
Introduces ZoME-Bench, a comprehensive benchmark for music editing.
Demonstrates effectiveness in complex non-rigid music edits.
Abstract
Text-guided diffusion models revolutionize audio generation by adapting source audio to specific text prompts. However, existing zero-shot audio editing methods such as DDIM inversion accumulate errors across diffusion steps, reducing the effectiveness. Moreover, existing editing methods struggle with conducting complex non-rigid music edits while maintaining content integrity and high fidelity. To address these challenges, we propose MEDIC, a novel zero-shot music editing system based on innovative Disentangled Inversion Control (DIC) technique, which comprises Harmonized Attention Control and Disentangled Inversion. Disentangled Inversion disentangles the diffusion process into triple branches to rectify the deviated path of the source branch caused by DDIM inversion. Harmonized Attention Control unifies the mutual self-attention control and the cross-attention control with an…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. Good zero-shot editing performance compared to previous STOA. The demo page shows the effective controllability of some music concepts that previous models failed to control. 2. The benchmark is very useful for future researchers on music editing. 3. The methodology of Harmonized Attention Control and Disentangled Inversion Technique is novel, which could help zero-shot editing of other domains.
1. While the experiments are about music editing, the evaluation only uses metrics for general audio editing. Music content-related metrics like chroma distance [1] are missing. 2. The paper does not seem to be clear enough. See questions. 3. The values and effects of the hyperparameters in the paper are unclear, like $k, \tau_c, L$ and $S$. Ablation study or case study by changing these hyperparameters would be helpful to understand the model. 4. While the methodology seems to be general-purpos
- The main idea of this paper—incorporating mutual self-attention, cross-attention control, and harmonic control—is sensible, even though each module is not entirely novel. The combination of these mechanisms appears effective, as results indicate that combining them enhances model performance in music editing tasks, providing useful insights. - The paper is thorough in its experimental design, including both subjective evaluations and a variety of objective experiments. The results effectively
Although this paper is a strong empirically-driven study, there are certain hypothesis-related issues that could be improved. - First, the paper needs to clarify what is meant by “rigid” and “non-rigid” tasks. These terms appear throughout the paper, but after re-reading the entire text, I still found no clear explanation of what these tasks entail, which left me quite confused. - The paper actually addresses a text-guided music audio editing task. However, the language and context in the main
To improve the music editing performance of DDIM inversion, the authors did not simply combine Cross-attention control and Mutual self-attention control; they introduced an additional Harmonic Branch to integrate these techniques. Furthermore, they proposed the Disentangled Inversion Technique. By leveraging these methods, they surpass existing music-editing methods in both objective and subjective metrics. Originality/Contribution: - Introduction of the Harmonic Branch and Disentangled Inversi
**Overall**: The following points represent the overall weaknesses in the current manuscript. Please refer to the detailed explanations in the latter part of the Weaknesses and Questions sections. 1. Insufficient or unclear validation of the effectiveness of the proposed method, which is directly related to the originality of this work. (For more details, see A. in Weaknesses and 1. in Questions.) 2. Unclear motivation for incorporating the inversion process (L3 in Algorithm 2) within the probl
- The proposed method overall seems reasonably novel and well-motivated. Much space is given to explaining the facets of their method, and graphical comparison to existing methods like MusicMagus is very appreciated. - Ablations of proposed method are solid and thorough, and shows clear strengths to the design choices the authors made.
Overall, while the proposed method is solidly novel and seems to perform better than current SOTA training-free editing approaches, issues in the overall clarity of the paper, evaluation suite, and in particular the proposed benchmark overweigh the contributions and thus I recommend rejection. # Overall Clarity --- The paper contains a number of grammatical errors, incorrect names of things, and incorrect citations. - The “Branch” term (line 070) is introduced without explanation. - "rigid"
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Advanced Data Storage Technologies
MethodsSoftmax · Attention Is All You Need · Diffusion
