Boosting Self-Supervised Embeddings for Speech Enhancement

Kuo-Hsuan Hung; Szu-wei Fu; Huan-Hsin Tseng; Hsin-Tien Chiang; Yu; Tsao; Chii-Wann Lin

arXiv:2204.03339·eess.AS·July 6, 2022·Interspeech

Boosting Self-Supervised Embeddings for Speech Enhancement

Kuo-Hsuan Hung, Szu-wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu, Tsao, Chii-Wann Lin

PDF

Open Access 1 Repo

TL;DR

This paper enhances speech enhancement by integrating SSL embeddings with spectrograms, improving performance and noise robustness analysis, and demonstrating superior results over existing SSL-based methods.

Contribution

It introduces a cross-domain feature combining SSL and spectrograms, analyzes noise robustness via CN distance, and shows improved speech enhancement performance without complex architectures.

Findings

01

SSL embeddings with lower noise robustness are more important for SE.

02

Fine-tuning SSL with SE models outperforms state-of-the-art SSL-based SE methods.

03

CN distance increases after fine-tuning, indicating improved robustness.

Abstract

Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) performance on several downstream tasks. However, there remains room for improvement in speech enhancement (SE) tasks. In this study, we used a cross-domain feature to solve the problem that SSL embeddings may lack fine-grained information to regenerate speech signals. By integrating the SSL representation and spectrogram, the result can be significantly boosted. We further study the relationship between the noise robustness of SSL representation via clean-noisy distance (CN distance) and the layer importance for SE. Consequently, we found that SSL representations with lower noise robustness are more important. Furthermore, our experiments on the VCTK-DEMAND dataset demonstrated that fine-tuning an SSL representation with an SE model can outperform the SOTA SSL-based SE methods in PESQ, CSIG…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

khhungg/BSSE-SE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hand Gesture Recognition Systems