Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation
Pyry Pyykk\"onen, Styliannos I. Mimilakis, Konstantinos Drossos, and Tuomas Virtanen

TL;DR
This paper explores replacing RNNs with depthwise separable convolutions in singing voice separation, demonstrating improved performance and reduced parameters, thus offering a faster and more efficient alternative.
Contribution
It introduces a novel application of DWS convolutions in music source separation, replacing RNNs and showing performance gains with fewer parameters.
Findings
DWS-CNNs outperform RNNs in separation metrics.
Replacing RNNs with DWS-CNNs reduces parameters by over 80%.
DWS-CNNs achieve higher signal-to-artifacts, interference, and distortion ratios.
Abstract
Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
