Semi-Supervised Singing Voice Separation with Noisy Self-Training
Zhepei Wang, Ritwik Giri, Umut Isik, Jean-Marc Valin, Arvindh, Krishnaswamy

TL;DR
This paper introduces a semi-supervised approach for singing voice separation that uses noisy self-training to leverage unlabeled data, improving performance over traditional supervised methods.
Contribution
It proposes a novel noisy self-training framework that effectively utilizes unlabeled data for singing voice separation, addressing data scarcity issues.
Findings
Self-training improves separation quality.
Data augmentation enhances model performance.
Outperforms supervised baselines.
Abstract
Recent progress in singing voice separation has primarily focused on supervised deep learning methods. However, the scarcity of ground-truth data with clean musical sources has been a problem for long. Given a limited set of labeled data, we present a method to leverage a large volume of unlabeled data to improve the model's performance. Following the noisy self-training framework, we first train a teacher network on the small labeled dataset and infer pseudo-labels from the large corpus of unlabeled mixtures. Then, a larger student network is trained on combined ground-truth and self-labeled datasets. Empirical results show that the proposed self-training scheme, along with data augmentation methods, effectively leverage the large unlabeled corpus and obtain superior performance compared to supervised methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
