On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement
Ashutosh Pandey, DeLiang Wang

TL;DR
This paper investigates the challenges of deep learning-based speech enhancement models in generalizing across different speech corpora, identifying channel mismatch as a key issue and proposing techniques like channel normalization and smaller frame shifts to improve cross-corpus performance.
Contribution
The study highlights the impact of channel mismatch on cross-corpus generalization and proposes combined techniques to enhance model robustness across different speech datasets.
Findings
Channel mismatch is the main cause of poor cross-corpus generalization.
Traditional channel normalization techniques are ineffective.
Using smaller frame shifts in STFT improves generalization.
Abstract
In recent years, supervised approaches using deep neural networks (DNNs) have become the mainstream for speech enhancement. It has been established that DNNs generalize well to untrained noises and speakers if trained using a large number of noises and speakers. However, we find that DNNs fail to generalize to new speech corpora in low signal-to-noise ratio (SNR) conditions. In this work, we establish that the lack of generalization is mainly due to the channel mismatch, i.e. different recording conditions between the trained and untrained corpus. Additionally, we observe that traditional channel normalization techniques are not effective in improving cross-corpus generalization. Further, we evaluate publicly available datasets that are promising for generalization. We find one particular corpus to be significantly better than others. Finally, we find that using a smaller frame shift in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
