Phase-Aware Deep Speech Enhancement: It's All About The Frame Length
Tal Peer, Timo Gerkmann

TL;DR
This paper investigates how phase and magnitude information contribute to low-latency speech enhancement using DNNs, showing that phase estimation becomes more effective with shorter frames, enabling high-quality, low-latency processing.
Contribution
It systematically studies the role of phase and magnitude in DNN-based speech enhancement across different frame lengths, demonstrating effective phase estimation with short frames.
Findings
DNNs can successfully estimate phase with short frames
Short frames enable low-latency speech enhancement with high quality
Phase-aware DNNs outperform magnitude-only approaches at low latency
Abstract
Algorithmic latency in speech processing is dominated by the frame length used for Fourier analysis, which in turn limits the achievable performance of magnitude-centric approaches. As previous studies suggest the importance of phase grows with decreasing frame length, this work presents a systematical study on the contribution of phase and magnitude in modern Deep Neural Network (DNN)-based speech enhancement at different frame lengths. Results indicate that DNNs can successfully estimate phase when using short frames, with similar or better overall performance compared to using longer frames. Thus, interestingly, modern phase-aware DNNs allow for low-latency speech enhancement at high quality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Ultrasonics and Acoustic Wave Propagation · Underwater Acoustics Research
