Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
Hila Manor, Tomer Michaeli

TL;DR
This paper introduces two zero-shot audio editing methods using DDPM inversion with pre-trained diffusion models, enabling semantic and text-based modifications of music signals without supervision.
Contribution
It presents the first zero-shot audio editing techniques leveraging DDPM inversion, including a novel unsupervised method for discovering meaningful editing directions.
Findings
Enables semantic audio editing without training data
Demonstrates control over instruments and melody in music signals
Provides open-source samples and code
Abstract
Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion with pre-trained diffusion models. The first, which we coin ZEro-shot Text-based Audio (ZETA) editing, is adopted from the image domain. The second, named ZEro-shot UnSupervized (ZEUS) editing, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody. Samples and code can be found in https://hilamanor.github.io/AudioEditing/ .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing
MethodsDiffusion
