Complex ratio masking for singing voice separation
Yixuan Zhang, Yuzhou Liu, DeLiang Wang

TL;DR
This paper introduces a complex ratio masking approach using DenseUNet with self-attention for singing voice separation, leveraging phase information to significantly improve separation quality over existing methods.
Contribution
It presents a novel complex ratio masking method that estimates both real and imaginary parts of STFT, incorporating self-attention and ensemble techniques for enhanced separation performance.
Findings
Outperforms recent state-of-the-art models in voice separation
Utilizes phase information for better separation quality
Employs DenseUNet with self-attention for accurate complex STFT estimation
Abstract
Music source separation is important for applications such as karaoke and remixing. Much of previous research focuses on estimating short-time Fourier transform (STFT) magnitude and discarding phase information. We observe that, for singing voice separation, phase can make considerable improvement in separation quality. This paper proposes a complex ratio masking method for voice and accompaniment separation. The proposed method employs DenseUNet with self attention to estimate the real and imaginary components of STFT for each sound source. A simple ensemble technique is introduced to further improve separation performance. Evaluation results demonstrate that the proposed method outperforms recent state-of-the-art models for both separated voice and accompaniment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
