Real-Time Band-Grouped Vocal Denoising Using Sigmoid-Driven Ideal Ratio Masking
Daniel Williams

TL;DR
This paper introduces a real-time vocal denoising method using a sigmoid-driven ideal ratio mask with band-grouped architecture, achieving low latency and improved perceptual quality.
Contribution
It presents a novel spectral loss-trained model with a band-grouped encoder-decoder architecture for low-latency, high-quality vocal denoising in real-time applications.
Findings
Achieves less than 10 ms latency.
Improves PESQ-WB by 0.21 on stationary noise.
Enhances perceptual quality of denoised vocals.
Abstract
Real-time, deep learning-based vocal denoising has seen significant progress over the past few years, demonstrating the capability of artificial intelligence in preserving the naturalness of the voice while increasing the signal-to-noise ratio (SNR). However, many deep learning approaches have high amounts of latency and require long frames of context, making them difficult to configure for live applications. To address these challenges, we propose a sigmoid-driven ideal ratio mask trained with a spectral loss to encourage an increased SNR and maximized perceptual quality of the voice. The proposed model uses a band-grouped encoder-decoder architecture with frequency attention and achieves a total latency of less than 10,ms, with PESQ-WB improvements of 0.21 on stationary noise and 0.12 on nonstationary noise.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
