High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching
Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar,, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, Yangyang, Shi, Vikas Chandra

TL;DR
MelodyFlow is a novel diffusion-based model that enables high-fidelity, text-guided music editing directly on continuous latent representations, outperforming previous methods in quality and versatility.
Contribution
It introduces a flow-matching trained diffusion transformer for efficient, high-quality music editing with simple text prompts, and adapts latent inversion methods for improved zero-shot editing.
Findings
Latent inversion with flow matching outperforms ReNoise and DDIM.
Subjective evaluations show substantial improvement over previous methods.
The model achieves high-fidelity editing at 48 kHz stereo with variable durations.
Abstract
We introduce MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model. It operates on continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec. Based on a diffusion transformer architecture trained on a flow-matching objective the model can edit diverse high quality stereo samples of variable duration, with simple text descriptions. We adapt the ReNoise latent inversion method to flow matching and compare it with the original implementation and naive denoising diffusion implicit model (DDIM) inversion on a variety of music editing prompts. Our results indicate that our latent inversion outperforms both ReNoise and DDIM for zero-shot test-time text-guided editing on several objective metrics. Subjective evaluations exhibit a substantial improvement over previous state of the art for music editing. Code and…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1- The authors demonstrated good music generation results 2- The authors promised the release of the code and model checkpoints which can be useful for the community 3- The paper is well written.
- Questionable quantitative results: In Table 1, the proposed method shows marginal improvement in relevance to the editing prompt, but it falls behind in terms of consistency. Additionally, a better overall sound quality does not necessarily indicate improved editing capability. Overall, the metrics presented regarding the editting are not convincing. - Poor qualitative results: The provided samples in the supp. do not demonstrate fine-grained editing capabilities, as they fail to showcase cas
- Music generation: The authors experimentally demonstrate that their FM-based text-to-music generative model can produce 48kHz, stereo-format waveforms. To achieve this, as discussed in Section 2.5, Appendix A.2, and Table 4, they conduct a detailed exploration of architectural improvements, including audio compression parts and enhanced training techniques such as Minibatch Coupling. - Music editing: They reformulate techniques of improved DDIM inversion, as proposed in Pix2pix-zero [1] and Re
**Overall**: 1. Music generation part: - The applicability of FM-based generative models for audio data is already explored in some prior work (for example, Text-to-Music Generation [1] and Text-to-Audio [2]). Therefore, simply applying FM-based model to text-to-music generation task (even 'single-stage') does not contribute significantly to new insights for the audio/music community. - Therefore, additional novel technical contributions, such as sophisticated ideas to surpass the sample qu
- This work is adapts ReNoise with flow matching on music modality plus its own modification, which demonstrates the feasibility of such methodology on music editing. The objective and subjective results also shown that the proposed model have better performance on editing tasks compared to other baselines. - In terms of generation, proposed model is capable of generating samples which have quality on par with baseline models in a much shorter inference latency. - The proposed model and metho
- Components of this work is a combination from existing works. Its novelty lies within the modification to the regularization method of ReNoise, but it lacks theoretical support/analysis on the effectiveness of such modification. It should be worth to have more discussion and analysis on, for example, why removing L_pair improves the result, whether there's a sweetspot for lambda_pair, and what's the side effect of removing L_pair. Empirically, it might be helpful to have a grid search based on
- Overall writing style is clear - Introduction of FM objective is streamlined and easy to follow. - The breadth of ablations are much appreciated, as the authors go to reasonable lengths to understand the design space and limitations of their method.
- Line (046-047) “editing methods from the computer vision domain, which are exclusive to diffusion models (Novack et al., 2024; Zhang et al., 2024; Manor & Michaeli, 2024)” is not wholly true. Optimization methods like Novack et al., 2024 are agnostic to the sampling process (and in fact, the flow-matching equivalent has already been explored [1]), and guidance methods like Zhang et al., 2024 are also agnostic to sampler (as nothing specific about their method *requires* DDIM inversion as the i
1. This paper introduces a novel single-stage flow matching model for text-to-music generation, capable of generating and editing audio samples at 48 kHz stereo quality. 2. The authors exploit a regularized flow matching inversion method to facilitate text-based music editing and conduct ablation studies to validate its effectiveness. 3. The experimental results demonstrate that the proposed approach outperforms all baseline methods across all objective metrics.
1. Although the paper claims to introduce the flow matching model for text-to-music generation, it is apparent that its performance on subjective metrics does not match that of Stable Audio. Furthermore, the authors have not provided an evaluation of Stable Audio's objective metrics. 2. Need a more comprehensive literature review. Important editing methods like DITTO or MEDIC are not in the literature review. Also need to compare with MusicMagus in zero-shot editing. 3. This work proposes a ne
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
MethodsDiffusion
