SoundMorpher: Perceptually-Uniform Sound Morphing with Diffusion Model
Xinlei Niu, Jing Zhang, Charles Patrick Martin

TL;DR
SoundMorpher introduces a perceptually-uniform sound morphing method using diffusion models and perceptual metrics, enabling high-quality sound transitions for creative and industrial applications.
Contribution
It proposes a novel sound morphing approach that explicitly models perceptual differences and introduces new evaluation metrics for sound morphing quality.
Findings
Effective in generating perceptually uniform sound transitions
Outperforms traditional linear interpolation methods
Versatile for real-world audio applications
Abstract
We present SoundMorpher, an open-world sound morphing method designed to generate perceptually uniform morphing trajectories. Traditional sound morphing techniques typically assume a linear relationship between the morphing factor and sound perception, achieving smooth transitions by linearly interpolating the semantic features of source and target sounds while gradually adjusting the morphing factor. However, these methods oversimplify the complexities of sound perception, resulting in limitations in morphing quality. In contrast, SoundMorpher explores an explicit relationship between the morphing factor and the perception of morphed sounds, leveraging log Mel-spectrogram features. This approach further refines the morphing sequence by ensuring a constant target perceptual difference for each transition and determining the corresponding morphing factors using binary search. To address…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The proposed method is very reasonable and has already shown good results in image processing; - Mostly, the quantitative comparisons are solid and show the effectiveness of the method in various audio domains; - The paper proposes metrics to measure the morphing quality in terms of correspondence, intermediateness and smoothness which is important for further development of this research topic.
- The contribution of the paper is relatively incremental as the paper borrows most of its ideas from (Yang et al. 2023). All essential features of the proposed method such as latents interpolation inside a Latent Diffusion Model, LoRA adaptation, finding the optimal trajectory with binary search on a sequence of values of an auxiliary metric, and even the introduction of 3 metrics for model evaluation, were proposed in (Yang et al. 2023). The paper is basically an attempt to adapt the approach
The paper is well-written and easy to follow, with examples provided in the supplementary material. There is an extensive experimental part with a user study, but mainly compares with SMT.
- The paper lacks novelty. Most of the elements presented in the paper do come from Yang et al.2023 IMPUS: IMAGE MORPHING WITH PERCEPTUALLYUNIFORM SAMPLING USING DIFFUSION MODELS. The present paper can thus be seen as a straightforward adaptation of IMPUS to the audio domain. - If LPIPS was chosen in IMPUS as the perceptual metric, here the choice of L2 over mel-spectrograms may be less appropriate. - Concatenating audio segments in Eq. 8 in x-space seems to produce abrupt transitions. Please n
This is a fairly well-written paper. I found it interesting and a pleasure to read. Even though the method it presents is not a groundbreaking novelty and is only an inference-time "trick" to achieve good sound morphing, I found it a clever way of handling and extracting desired results from a pre-trained generative model. The related work section is quite comprehensive (at least, I couldn’t recall any paper that hasn't been mentioned here), the objectives of the work are clear, and the method i
Here is the list of my concerns following the order of the sections. Sound morphing preliminary: In the paragraph of line – 141 -149, authors define 3 criterias Correspondence, Intermediateness, and Smoothness. I think there are lots of critiques one can brung up to these criterias. 1. Correspondence: This criterion requires that the morph captures semantic-level transitions. However, perception of “semantic” qualities in sound can be subjective and context-dependent. If listeners interpret t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Neuroscience and Music Perception · Hearing Loss and Rehabilitation
MethodsSparse Evolutionary Training · Diffusion
