A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems
Karn N. Watcharasupat, Alexander Lerch

TL;DR
Banquet is a novel single-decoder system for music source separation that efficiently handles multiple stems beyond traditional four-stem setups, supporting diverse instruments with high performance and low complexity.
Contribution
It introduces a query-based, stem-agnostic source separation model that extends bandsplit techniques with a music instrument recognition component, enabling flexible and scalable separation.
Findings
Approaches performance of complex models on VDBO stems
Outperforms on guitar and piano separation
Supports extraction of rare instrument stems
Abstract
Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer
