Feature Learning and Ensemble Pre-Tasks Based Self-Supervised Speech Denoising and Dereverberation
Yi Li, ShuangLin Li, Yang Sun, Syed Mohsen Naqvi

TL;DR
This paper introduces a novel self-supervised speech enhancement approach that combines feature learning and ensemble pre-tasks, improving denoising and dereverberation, especially for unseen speakers, by leveraging multiple feature types and training strategies.
Contribution
It proposes an ensemble training strategy with multiple pre-tasks, including latent speech representation and mask estimation, to enhance speech denoising and dereverberation in SSL frameworks.
Findings
Outperforms state-of-the-art methods on NOISEX and DAPS datasets.
Effective combination of features improves speech enhancement accuracy.
Ensemble pre-tasks enhance generalization to unseen speakers.
Abstract
Self-supervised learning (SSL) achieves great success in monaural speech enhancement, while the accuracy of the target speech estimation, particularly for unseen speakers, remains inadequate with existing pre-tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, and spoken content, the latent representation for speech enhancement becomes a tough task. In this paper, we study the effectiveness of each feature which is commonly used in speech enhancement and exploit the feature combination in the SSL case. Besides, we propose an ensemble training strategy. The latent representation of the clean speech signal is learned, meanwhile, the dereverberated mask and the estimated ratio mask are exploited to denoise and dereverberate the mixture. The latent representation learning and the masks estimation are considered as two pre-tasks in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
