A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments
Md Jahangir Alam Khondkar, Ajan Ahmed, Stephanie Schuckers, and Masudul Haider Imtiaz

TL;DR
This study benchmarks three deep learning models for speech enhancement in noisy environments, evaluating their noise suppression, perceptual quality, and speaker feature retention across multiple datasets.
Contribution
It provides a comprehensive comparative analysis of Wave-U-Net, CMGAN, and U-Net, highlighting their strengths and trade-offs in real-world speech enhancement tasks.
Findings
U-Net achieves highest noise suppression with significant SNR improvements.
CMGAN attains the best perceptual quality with top PESQ scores.
Wave-U-Net balances noise suppression and speaker feature retention.
Abstract
Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies
