Continual Learning for Singing Voice Separation with Human in the Loop Adaptation
Ankur Gupta, Anshul Rai, Archit Bansal, Vipul Arora

TL;DR
This paper introduces an interactive continual learning framework for singing voice separation that enables user-guided model fine-tuning to adapt to diverse music tracks, improving performance in real-world scenarios.
Contribution
It presents a novel human-in-the-loop continual learning approach with two algorithms, enhancing singing voice separation adaptability and performance.
Findings
Improved separation performance over base models.
Effective adaptation in intra-dataset and inter-dataset scenarios.
User feedback significantly refines model accuracy.
Abstract
Deep learning-based works for singing voice separation have performed exceptionally well in the recent past. However, most of these works do not focus on allowing users to interact with the model to improve performance. This can be crucial when deploying the model in real-world scenarios where music tracks can vary from the original training data in both genre and instruments. In this paper, we present a deep learning-based interactive continual learning framework for singing voice separation that allows users to fine-tune the vocal separation model to conform it to new target songs. We use a U-Net-based base model architecture that produces a mask for separating vocals from the spectrogram, followed by a human-in-the-loop task where the user provides feedback by marking a few false positives, i.e., regions in the extracted vocals that should have been silence. We propose two continual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders
