Unsupervised Source Separation By Steering Pretrained Music Models

Ethan Manilow; Patrick O'Reilly; Prem Seetharaman; Bryan Pardo

arXiv:2110.13071·cs.SD·October 26, 2021

Unsupervised Source Separation By Steering Pretrained Music Models

Ethan Manilow, Patrick O'Reilly, Prem Seetharaman, Bryan Pardo

PDF

Open Access 1 Repo

TL;DR

This paper introduces an unsupervised approach to audio source separation that leverages pretrained music generation and tagging models without retraining, by navigating the generative model's latent space guided by source labels.

Contribution

It demonstrates a novel method that repurposes existing pretrained models for source separation through latent space optimization, avoiding retraining or weight updates.

Findings

01

Outperforms many supervised and unsupervised methods on separation tasks

02

Works across diverse source types and datasets

03

Shows potential of large pretrained music models for audio tasks

Abstract

We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio. This generated audio is fed to a pretrained music tagger that creates source labels. The cross-entropy loss between the tag distribution for the generated audio and a predefined distribution for an isolated source is used to guide gradient ascent in the (unchanging) latent space of the generative model. This system does not update the weights of the generative model or the tagger, and only relies on moving through the generative model's latent space to produce separated sources. We use OpenAI's Jukebox as the pretrained generative model, and we couple it with four kinds of pretrained music taggers (two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ethman/tagbox
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies

MethodsResidual Connection · Dilated Convolution · Dense Connections · VQ-VAE · Position-Wise Feed-Forward Layer · Convolution · Layer Normalization · Jukebox