STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency
Zhong-Qiu Wang, Gordon Wichern, Shinji Watanabe, Jonathan Le, Roux

TL;DR
This paper introduces a low-latency STFT-domain neural speech enhancement method using dual-window sizes, complex spectral mapping, and future-frame prediction, achieving high performance with minimal delay.
Contribution
It proposes a novel dual-window approach and a future-frame prediction technique for real-time speech enhancement with extremely low latency.
Findings
Achieves 2 ms algorithmic latency in speech enhancement.
Outperforms Conv-TasNet in noisy-reverberant conditions.
Maintains high enhancement quality with reduced computational cost.
Abstract
Deep learning based speech enhancement in the short-time Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window can lead to higher frequency resolution and potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed using the same window size. To reduce this inherent latency, we adapt a conventional dual-window-size approach, where a regular input window size is used for STFT but a shorter output window is used for overlap-add, for STFT-domain deep learning based frame-online speech enhancement. Based on this STFT-iSTFT configuration, we employ complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Ultrasonics and Acoustic Wave Propagation · Advanced Adaptive Filtering Techniques
