AudioMorphix: Training-free audio editing with diffusion probabilistic models
Jinhua Liang, Yuanzhe Chen, Yi Yuan, Dongya Jia, Xiaobin Zhuang, Zhuo Chen, Yuping Wang, Yuxuan Wang

TL;DR
AudioMorphix is a training-free, diffusion-based audio editing method that enables precise, localized modifications to spectrograms by referencing other recordings, preserving audio fidelity and supporting diverse editing tasks.
Contribution
It introduces a novel, training-free approach for localized audio editing using diffusion models and morphing theory, with an enhanced self-attention mechanism and a new evaluation benchmark.
Findings
Achieves high fidelity in various editing tasks
Enables precise modifications while preserving original audio quality
Demonstrates promising results across multiple editing scenarios
Abstract
Editing sound with precision is a crucial yet underexplored challenge in audio content creation. While existing works can manipulate sounds by text instructions or audio exemplar pairs, they often struggled to modify audio content precisely while preserving fidelity to the original recording. In this work, we introduce a novel editing approach that enables localized modifications to specific time-frequency regions while keeping the remaining of the audio intact by operating on spectrograms directly. To achieve this, we propose AudioMorphix, a training-free audio editor that manipulates a target region on the spectrogram by referring to another recording. Inspired by morphing theory, we conceptualize audio mixing as a process where different sounds blend seamlessly through morphing and can be decomposed back into individual components via demorphing. Our AudioMorphix optimizes the noised…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
– Training free approach for audio editing is interesting. Done well it could lead to interesting directions for generative methods for audio editing. – A new dataset is also created. If the dataset is released publicly it would be useful for the community
– The paper is hard to follow. It’s not clear how the overall framework is coming together. An overall system diagram showing what is going would be helpful. How are all the pieces in the model connected ? – How will the proposed method handle editing which is a combination of addition, removal and replacement for a given clip ? Also some experimental results on the same are also expected. Moreover, do we really need something extensive to do addition ? Why can't we simply add the signals whil
1. This paper explores an interesting topic, which is reference audio-based audio editing. 2. By utilizing a series of energy functions, the method performs better than DDIM inversion. 3. This work introduces a novel test set for audio editing.
1. [1] and [2] are mentioned in this paper. Zhang's work uses cross-attention control for music editing, while this paper does not compare with this paper in both methodology and experiments. Considering the similarity between these two works, it should be regarded as a weakness of this paper. Additionally, [3] and [4] are related to this work, the authors should include them in the related works. 2. Subjective evaluation is missing. It is always necessary when the audio generation model is prop
A new dataset (or benchmark?) for audio editing.
The paper is filled with approximations and expressions that make little sense. The paper refers a lot to previous methods and it is not always clear what is novel and what is inspired from prior work. Some core concepts are not defined (e.g. "energy functions").
- The proposed energy function for addition and removal are novel - Good paper survey + regorous adhoc trials are done to get nice results in objective metrics
- The novelties of this work are not clearly written. I believe tangent proj. and memory bank are not novel while the proposed energy functions as well as using SLERP are the original contributions of this work. I recommend listing your original contributions in Intro - The effectiveness of SLERP is not verified. The method should be compared to LERP to confirm your assumption - The audio samples on the given webpage link for "removal" are not impressive at all, in comparison with the ones on th
1. Technical Depth: The authors delve into the technical details of how AudioMorphix works, including the use of spherical linear interpolation for latent state interpolation and the design of energy functions to guide the diffusion process. 2. New Evaluation Dataset: The creation of AudioSet-E, a new dataset for evaluating audio editing methods, is a valuable contribution to the research community. It provides a standardized way to assess the performance of different audio editing techniques.
1. Confusing charts and tables. For example, Figure 1 is difficult to read. The layout between the figures is messy. It takes a lot of time to see the relationship between the audio and the processing method. There are still some questions. For example: In Style Transfer, is audio or text needed as a reference signal? The first example of Audio Removal will introduce new content in the edited audio. Is it in line with expectations? 2. Insufficient references and experimental comparisons There a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
