Combining audio control and style transfer using latent diffusion
Nils Demerl\'e, Philippe Esling, Guillaume Doras, David Genova

TL;DR
This paper introduces a diffusion autoencoder-based model that unifies explicit control and style transfer for audio, enabling high-quality, controllable music synthesis and style transfer with improved fidelity.
Contribution
It proposes a novel approach that separates local and global musical features for combined control and style transfer using diffusion autoencoders and adversarial training.
Findings
Outperforms baselines in timbre transfer quality
Achieves high fidelity in MIDI-to-audio conversion
Generates cover versions with genre transfer
Abstract
Deep generative models are now able to synthesize high-quality audio signals, shifting the critical aspect in their development from audio quality to control capabilities. Although text-to-music generation is getting largely adopted by the general public, explicit control and example-based style transfer are more adequate modalities to capture the intents of artists and musicians. In this paper, we aim to unify explicit control and style transfer within a single model by separating local and global information to capture musical structure and timbre respectively. To do so, we leverage the capabilities of diffusion autoencoders to extract semantic features, in order to build two representation spaces. We enforce disentanglement between those spaces using an adversarial criterion and a two-stage training strategy. Our resulting model can generate audio matching a timbre target, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
MethodsDiffusion
