User-guided Generative Source Separation
Yutong Wen, Minje Kim, and Paris Smaragdis

TL;DR
GuideSep introduces a flexible, diffusion-based music source separation model that allows user-guided, instrument-agnostic extraction, surpassing traditional fixed-class methods in versatility and quality.
Contribution
This work presents GuideSep, a novel diffusion-based MSS model conditioned on user inputs, enabling versatile and high-quality instrument separation beyond standard four-stem setups.
Findings
Achieves high-quality separation with user-guided inputs
Demonstrates versatility in extracting various instruments
Outperforms prior fixed-class separation methods
Abstract
Music source separation (MSS) aims to extract individual instrument sources from their mixture. While most existing methods focus on the widely adopted four-stem separation setup (vocals, bass, drums, and other instruments), this approach lacks the flexibility needed for real-world applications. To address this, we propose GuideSep, a diffusion-based MSS model capable of instrument-agnostic separation beyond the four-stem setup. GuideSep is conditioned on multiple inputs: a waveform mimicry condition, which can be easily provided by humming or playing the target melody, and mel-spectrogram domain masks, which offer additional guidance for separation. Unlike prior approaches that relied on fixed class labels or sound queries, our conditioning scheme, coupled with the generative approach, provides greater flexibility and applicability. Additionally, we design a mask-prediction baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music Technology and Sound Studies · Music and Audio Processing
