Generative Audio Extension and Morphing
Prem Seetharaman, Oriol Nieto, Justin Salamon

TL;DR
This paper introduces a novel generative audio method that extends and morphs sounds using masked latents and classifier-free guidance, supported by objective metrics and listener tests, enhancing creative sound design.
Contribution
It presents a new technique for extending and morphing audio with masked latents and guidance, improving controllability and reducing hallucinations in generative audio models.
Findings
Generated audio achieves FADs comparable to real data.
Subjective listener tests favor the generated audio.
Fine-tuning reduces hallucinations in generated sounds.
Abstract
In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a novel variant of classifier-free guidance on such masked latents, we demonstrate that: (i) given an audio reference, we can extend it both forward and backward for a specified duration, and (ii) given two audio references, we can morph them seamlessly for the desired duration. Furthermore, we show that by fine-tuning the model on different types of stationary audio data we mitigate potential hallucinations. The effectiveness of our method is supported by objective metrics, with the generated audio achieving Fr\'echet Audio Distances (FADs) comparable to those of real samples from the training data. Additionally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing
