Improving Noise Robustness of Contrastive Speech Representation Learning   with Speech Reconstruction

Heming Wang; Yao Qian; Xiaofei Wang; Yiming Wang; Chengyi Wang; Shujie; Liu; Takuya Yoshioka; Jinyu Li; DeLiang Wang

arXiv:2110.15430·cs.SD·November 1, 2021·1 cites

Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction

Heming Wang, Yao Qian, Xiaofei Wang, Yiming Wang, Chengyi Wang, Shujie, Liu, Takuya Yoshioka, Jinyu Li, DeLiang Wang

PDF

Open Access

TL;DR

This paper introduces a noise-robust speech representation learning method combining contrastive learning with a reconstruction module, significantly improving ASR performance in noisy environments without requiring denoising during inference.

Contribution

It proposes a novel multi-task continual pre-training framework that enhances noise robustness of speech representations using a reconstruction module alongside contrastive learning.

Findings

01

Reduces WER by around 4.1/7.5% on noisy LibriSpeech test sets

02

Achieves state-of-the-art performance on CHiME-4 noisy speech recognition

03

Performs comparably to supervised methods with only 16% labeled data

Abstract

Noise robustness is essential for deploying automatic speech recognition (ASR) systems in real-world environments. One way to reduce the effect of noise interference is to employ a preprocessing module that conducts speech enhancement, and then feed the enhanced speech to an ASR backend. In this work, instead of suppressing background noise with a conventional cascaded pipeline, we employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We propose to combine a reconstruction module with contrastive learning and perform multi-task continual pre-training on noisy data. The reconstruction module is used for auxiliary learning to improve the noise robustness of the learned representation and thus is not required during inference. Experiments demonstrate the effectiveness of our proposed method. Our model substantially reduces the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsTest · Contrastive Learning