Complex ratio masking for singing voice separation

Yixuan Zhang; Yuzhou Liu; DeLiang Wang

arXiv:2011.02008·eess.AS·November 5, 2020·1 cites

Complex ratio masking for singing voice separation

Yixuan Zhang, Yuzhou Liu, DeLiang Wang

PDF

Open Access

TL;DR

This paper introduces a complex ratio masking approach using DenseUNet with self-attention for singing voice separation, leveraging phase information to significantly improve separation quality over existing methods.

Contribution

It presents a novel complex ratio masking method that estimates both real and imaginary parts of STFT, incorporating self-attention and ensemble techniques for enhanced separation performance.

Findings

01

Outperforms recent state-of-the-art models in voice separation

02

Utilizes phase information for better separation quality

03

Employs DenseUNet with self-attention for accurate complex STFT estimation

Abstract

Music source separation is important for applications such as karaoke and remixing. Much of previous research focuses on estimating short-time Fourier transform (STFT) magnitude and discarding phase information. We observe that, for singing voice separation, phase can make considerable improvement in separation quality. This paper proposes a complex ratio masking method for voice and accompaniment separation. The proposed method employs DenseUNet with self attention to estimate the real and imaginary components of STFT for each sound source. A simple ensemble technique is introduced to further improve separation performance. Evaluation results demonstrate that the proposed method outperforms recent state-of-the-art models for both separated voice and accompaniment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis