Self-Supervised Learning based Monaural Speech Enhancement with Multi-Task Pre-Training
Yi Li, Yang Sun, Syed Mohsen Naqvi

TL;DR
This paper introduces a multi-task pre-training approach for self-supervised monaural speech enhancement, leveraging limited clean speech data and multiple pre-tasks to improve denoising performance on reverberant mixtures.
Contribution
It proposes a novel multi-task pre-training framework combining a pre-training autoencoder and a downstream autoencoder for enhanced speech denoising.
Findings
Outperforms state-of-the-art speech enhancement methods
Effective with limited clean speech data
Improves denoising on unseen reverberant mixtures
Abstract
In self-supervised learning, it is challenging to reduce the gap between the enhancement performance on the estimated and target speech signals with existed pre-tasks. In this paper, we propose a multi-task pre-training method to improve the speech enhancement performance with self-supervised learning. Within the pre-training autoencoder (PAE), only a limited set of clean speech signals are required to learn their latent representations. Meanwhile, to solve the limitation of single pre-task, the proposed masking module exploits the dereverberated mask and estimated ratio mask to denoise the mixture as the second pre-task. Different from the PAE, where the target speech signals are estimated, the downstream task autoencoder (DAE) utilizes a large number of unlabeled and unseen reverberant mixtures to generate the estimated mixtures. The trained DAE is shared by the learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies
