Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
Pengbo Lyu, Xiangyu Zhao, Chengwei Liu, Haoyin Yan, Xiaotao Liang, Hongyu Wang, and Shaofei Xue

TL;DR
This paper introduces a novel generative approach for multi-track music source separation using language models to generate discrete audio tokens, achieving high perceptual quality and state-of-the-art metrics.
Contribution
It reformulates music source separation as conditional discrete token generation with a new neural architecture combining Conformer, HCodec, and language models.
Findings
Achieves perceptual quality close to state-of-the-art discriminative methods.
Attains highest NISQA score on vocals track in MUSDB18-HQ.
Ablation confirms the effectiveness of the Conformer encoder and sequential cross-track generation.
Abstract
We propose a generative framework for multi-track music source separation (MSS) that reformulates the task as conditional discrete token generation. Unlike conventional approaches that directly estimate continuous signals in the time or frequency domain, our method combines a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model to autoregressively generate audio tokens for four target tracks. The generated tokens are decoded back to waveforms through the codec decoder. Evaluation on the MUSDB18-HQ benchmark shows that our generative approach achieves perceptual quality approaching state-of-the-art discriminative methods, while attaining the highest NISQA score on the vocals track. Ablation studies confirm the effectiveness of the learnable Conformer encoder and the benefit of sequential cross-track generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
