Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation
Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, Yuxuan Wang

TL;DR
This paper introduces a deep residual UNet model that decouples magnitude and phase estimation for music source separation, allowing masks larger than 1 and achieving state-of-the-art results on MUSDB18.
Contribution
It proposes a novel approach to estimate complex ideal ratio masks by decoupling magnitude and phase, and extends the model to handle masks greater than 1, with a deep residual architecture.
Findings
Achieves a SDR of 8.98 dB on vocals, surpassing previous best of 7.24 dB.
Effectively estimates phase by decoupling from magnitude in complex IRMs.
Utilizes a deep residual UNet with up to 143 layers for improved separation.
Abstract
Deep neural network based methods have been successfully applied to music source separation. They typically learn a mapping from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-frequency bins have ideal ratio mask values of over~1 in a popular dataset, MUSDB18, 3) its potential on very deep architectures is under-explored. Our proposed system is designed to overcome these. First, we propose to estimate phases by estimating complex ideal ratio masks (cIRMs) where we decouple the estimation of cIRMs into magnitude and phase estimations. Second, we extend the separation method to effectively allow the magnitude of the mask to be larger than 1. Finally, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Blind Source Separation Techniques
