Unsupervised Source Separation By Steering Pretrained Music Models
Ethan Manilow, Patrick O'Reilly, Prem Seetharaman, Bryan Pardo

TL;DR
This paper introduces an unsupervised approach to audio source separation that leverages pretrained music generation and tagging models without retraining, by navigating the generative model's latent space guided by source labels.
Contribution
It demonstrates a novel method that repurposes existing pretrained models for source separation through latent space optimization, avoiding retraining or weight updates.
Findings
Outperforms many supervised and unsupervised methods on separation tasks
Works across diverse source types and datasets
Shows potential of large pretrained music models for audio tasks
Abstract
We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio. This generated audio is fed to a pretrained music tagger that creates source labels. The cross-entropy loss between the tag distribution for the generated audio and a predefined distribution for an isolated source is used to guide gradient ascent in the (unchanging) latent space of the generative model. This system does not update the weights of the generative model or the tagger, and only relies on moving through the generative model's latent space to produce separated sources. We use OpenAI's Jukebox as the pretrained generative model, and we couple it with four kinds of pretrained music taggers (two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
MethodsResidual Connection · Dilated Convolution · Dense Connections · VQ-VAE · Position-Wise Feed-Forward Layer · Convolution · Layer Normalization · Jukebox
