SoundMorpher: Perceptually-Uniform Sound Morphing with Diffusion Model

Xinlei Niu; Jing Zhang; Charles Patrick Martin

arXiv:2410.02144·cs.SD·December 17, 2024

SoundMorpher: Perceptually-Uniform Sound Morphing with Diffusion Model

Xinlei Niu, Jing Zhang, Charles Patrick Martin

PDF

Open Access 3 Reviews

TL;DR

SoundMorpher introduces a perceptually-uniform sound morphing method using diffusion models and perceptual metrics, enabling high-quality sound transitions for creative and industrial applications.

Contribution

It proposes a novel sound morphing approach that explicitly models perceptual differences and introduces new evaluation metrics for sound morphing quality.

Findings

01

Effective in generating perceptually uniform sound transitions

02

Outperforms traditional linear interpolation methods

03

Versatile for real-world audio applications

Abstract

We present SoundMorpher, an open-world sound morphing method designed to generate perceptually uniform morphing trajectories. Traditional sound morphing techniques typically assume a linear relationship between the morphing factor and sound perception, achieving smooth transitions by linearly interpolating the semantic features of source and target sounds while gradually adjusting the morphing factor. However, these methods oversimplify the complexities of sound perception, resulting in limitations in morphing quality. In contrast, SoundMorpher explores an explicit relationship between the morphing factor and the perception of morphed sounds, leveraging log Mel-spectrogram features. This approach further refines the morphing sequence by ensuring a constant target perceptual difference for each transition and determining the corresponding morphing factors using binary search. To address…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

- The proposed method is very reasonable and has already shown good results in image processing; - Mostly, the quantitative comparisons are solid and show the effectiveness of the method in various audio domains; - The paper proposes metrics to measure the morphing quality in terms of correspondence, intermediateness and smoothness which is important for further development of this research topic.

Weaknesses

- The contribution of the paper is relatively incremental as the paper borrows most of its ideas from (Yang et al. 2023). All essential features of the proposed method such as latents interpolation inside a Latent Diffusion Model, LoRA adaptation, finding the optimal trajectory with binary search on a sequence of values of an auxiliary metric, and even the introduction of 3 metrics for model evaluation, were proposed in (Yang et al. 2023). The paper is basically an attempt to adapt the approach

Reviewer 02Rating 3Confidence 5

Strengths

The paper is well-written and easy to follow, with examples provided in the supplementary material. There is an extensive experimental part with a user study, but mainly compares with SMT.

Weaknesses

- The paper lacks novelty. Most of the elements presented in the paper do come from Yang et al.2023 IMPUS: IMAGE MORPHING WITH PERCEPTUALLYUNIFORM SAMPLING USING DIFFUSION MODELS. The present paper can thus be seen as a straightforward adaptation of IMPUS to the audio domain. - If LPIPS was chosen in IMPUS as the perceptual metric, here the choice of L2 over mel-spectrograms may be less appropriate. - Concatenating audio segments in Eq. 8 in x-space seems to produce abrupt transitions. Please n

Reviewer 03Rating 5Confidence 4

Strengths

This is a fairly well-written paper. I found it interesting and a pleasure to read. Even though the method it presents is not a groundbreaking novelty and is only an inference-time "trick" to achieve good sound morphing, I found it a clever way of handling and extracting desired results from a pre-trained generative model. The related work section is quite comprehensive (at least, I couldn’t recall any paper that hasn't been mentioned here), the objectives of the work are clear, and the method i

Weaknesses

Here is the list of my concerns following the order of the sections. Sound morphing preliminary: In the paragraph of line – 141 -149, authors define 3 criterias Correspondence, Intermediateness, and Smoothness. I think there are lots of critiques one can brung up to these criterias. 1. Correspondence: This criterion requires that the morph captures semantic-level transitions. However, perception of “semantic” qualities in sound can be subjective and context-dependent. If listeners interpret t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Neuroscience and Music Perception · Hearing Loss and Rehabilitation

MethodsSparse Evolutionary Training · Diffusion