Jointly Detecting and Separating Singing Voice: A Multi-Task Approach
Daniel Stoller, Sebastian Ewert, Simon Dixon

TL;DR
This paper presents a multi-task learning approach that jointly detects and separates singing voices, improving performance and robustness across datasets, while highlighting the need for better evaluation metrics.
Contribution
It introduces a multi-task model combining vocal activity detection and separation, addressing dataset biases and demonstrating improved results over single-task methods.
Findings
Enhanced separation and detection performance compared to baselines.
Robustness to dataset biases in vocal separation and detection.
SDR metrics may not fully capture improvements in non-vocal sections.
Abstract
A main challenge in applying deep learning to music processing is the availability of training data. One potential solution is Multi-task Learning, in which the model also learns to solve related auxiliary tasks on additional datasets to exploit their correlation. While intuitive in principle, it can be challenging to identify related tasks and construct the model to optimally share information between tasks. In this paper, we explore vocal activity detection as an additional task to stabilise and improve the performance of vocal separation. Further, we identify problematic biases specific to each dataset that could limit the generalisation capability of separation and detection models, to which our proposed approach is robust. Experiments show improved performance in separation as well as vocal detection compared to single-task baselines. However, we find that the commonly used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
