Neural Vocoder Feature Estimation for Dry Singing Voice Separation
Jaekwon Im, Soonbeom Choi, Sangeon Yong, Juhan Nam

TL;DR
This paper introduces a novel singing voice separation method that predicts dry vocal mel-spectrograms using neural vocoder features, improving separation quality over existing models by focusing on dereverberation and reusability.
Contribution
It proposes predicting dry singing voice mel-spectrograms with neural vocoder features and incorporates a singing voice detector, advancing separation techniques beyond spectrogram masking.
Findings
Outperforms state-of-the-art models in objective metrics
Achieves better dereverberation and separation quality
Improves reusability of isolated singing voices
Abstract
Singing voice separation (SVS) is a task that separates singing voice audio from its mixture with instrumental audio. Previous SVS studies have mainly employed the spectrogram masking method which requires a large dimensionality in predicting the binary masks. In addition, they focused on extracting a vocal stem that retains the wet sound with the reverberation effect. This result may hinder the reusability of the isolated singing voice. This paper addresses the issues by predicting mel-spectrogram of dry singing voices from the mixed audio as neural vocoder features and synthesizing the singing voice waveforms from the neural vocoder. We experimented with two separation methods. One is predicting binary masks in the mel-spectrogram domain and the other is directly predicting the mel-spectrogram. Furthermore, we add a singing voice detector to identify the singing voice segments over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
