TL;DR
This paper introduces MaD TwinNet, a deep learning architecture that improves monaural singing voice separation by modeling long-term musical structures using twin networks, achieving state-of-the-art results.
Contribution
It presents a novel combination of Masker-Denoiser architecture with Twin Networks to better capture long-term dependencies in music separation tasks.
Findings
Achieved 0.37 dB SDR improvement over previous SOTA.
Achieved 0.23 dB SIR improvement over previous SOTA.
Validated on Demixing Secret Dataset.
Abstract
Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel deep learning based method that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
