Boosting Self-Supervised Embeddings for Speech Enhancement
Kuo-Hsuan Hung, Szu-wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu, Tsao, Chii-Wann Lin

TL;DR
This paper enhances speech enhancement by integrating SSL embeddings with spectrograms, improving performance and noise robustness analysis, and demonstrating superior results over existing SSL-based methods.
Contribution
It introduces a cross-domain feature combining SSL and spectrograms, analyzes noise robustness via CN distance, and shows improved speech enhancement performance without complex architectures.
Findings
SSL embeddings with lower noise robustness are more important for SE.
Fine-tuning SSL with SE models outperforms state-of-the-art SSL-based SE methods.
CN distance increases after fine-tuning, indicating improved robustness.
Abstract
Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) performance on several downstream tasks. However, there remains room for improvement in speech enhancement (SE) tasks. In this study, we used a cross-domain feature to solve the problem that SSL embeddings may lack fine-grained information to regenerate speech signals. By integrating the SSL representation and spectrogram, the result can be significantly boosted. We further study the relationship between the noise robustness of SSL representation via clean-noisy distance (CN distance) and the layer importance for SE. Consequently, we found that SSL representations with lower noise robustness are more important. Furthermore, our experiments on the VCTK-DEMAND dataset demonstrated that fine-tuning an SSL representation with an SE model can outperform the SOTA SSL-based SE methods in PESQ, CSIG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hand Gesture Recognition Systems
