Investigating self-supervised learning for speech enhancement and separation
Zili Huang, Shinji Watanabe, Shu-wen Yang, Paola Garcia, Sanjeev, Khudanpur

TL;DR
This paper evaluates 13 self-supervised learning methods for speech enhancement and separation, showing some outperform traditional features and analyzing the challenges and properties needed for SSL in these tasks.
Contribution
It provides a comprehensive evaluation of SSL methods on speech enhancement and separation, highlighting their effectiveness and analyzing key factors affecting their application.
Findings
Some SSL representations outperform baseline features like STFT and FBANK.
Analysis of factors hindering SSL application to speech tasks.
Discussion on desirable representation properties for enhancement and separation.
Abstract
Speech enhancement and separation are two fundamental tasks for robust speech processing. Speech enhancement suppresses background noise while speech separation extracts target speech from interfering speakers. Despite a great number of supervised learning-based enhancement and separation methods having been proposed and achieving good performance, studies on applying self-supervised learning (SSL) to enhancement and separation are limited. In this paper, we evaluate 13 SSL upstream methods on speech enhancement and separation downstream tasks. Our experimental results on Voicebank-DEMAND and Libri2Mix show that some SSL representations consistently outperform baseline features including the short-time Fourier transform (STFT) magnitude and log Mel filterbank (FBANK). Furthermore, we analyze the factors that make existing SSL frameworks difficult to apply to speech enhancement and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
