Music Source Separation with Band-Split RoPE Transformer
Wei-Tsung Lu, Ju-Chiang Wang, Qiuqiang Kong, Yun-Ning Hung

TL;DR
This paper introduces a novel frequency-domain approach called BS-RoFormer, using band-split modules and hierarchical Transformers with RoPE for music source separation, achieving state-of-the-art results.
Contribution
The paper proposes a new Band-Split RoPE Transformer architecture for MSS, combining band-split modules and hierarchical Transformers with Rotary Position Embedding.
Findings
Ranked first in Sound Demixing Challenge (SDX23) MSS track.
Achieved 9.80 dB average SDR on MUSDB18HQ without extra data.
Outperformed previous methods with state-of-the-art results.
Abstract
Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Acoustic Wave Phenomena Research · Music and Audio Processing
