Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector   Quantization

Xiao-Ying Zhao; Qiu-Shi Zhu; Jie Zhang

arXiv:2209.14150·eess.AS·September 29, 2022

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

Xiao-Ying Zhao, Qiu-Shi Zhu, Jie Zhang

PDF

Open Access

TL;DR

This paper introduces a speech enhancement method leveraging self-supervised pre-trained models and vector quantization, improving real-time denoising performance by discretizing speech representations and adapting model architecture.

Contribution

It applies self-supervised pre-trained WavLM to initialize a modified DEMUCS model with causal convolutions and attention, incorporating vector quantization for enhanced denoising.

Findings

01

Pre-trained model initialization improves speech enhancement performance.

02

Vector quantization suppresses noise in speech representations.

03

Method outperforms baseline models on Valentini and internal datasets.

Abstract

With the development of deep learning, neural network-based speech enhancement (SE) models have shown excellent performance. Meanwhile, it was shown that the development of self-supervised pre-trained models can be applied to various downstream tasks. In this paper, we will consider the application of the pre-trained model to the real-time SE problem. Specifically, the encoder and bottleneck layer of the DEMUCS model are initialized using the self-supervised pretrained WavLM model, the convolution in the encoder is replaced by causal convolution, and the transformer encoder in the bottleneck layer is based on causal attention mask. In addition, as discretizing the noisy speech representations is more beneficial for denoising, we utilize a quantization module to discretize the representation output from the bottleneck layer, which is then fed into the decoder to reconstruct the clean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies

MethodsConvolution