Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs   for Robust Speech Recognition

Yiming Wang; Jinyu Li; Heming Wang; Yao Qian; Chengyi Wang; Yu Wu

arXiv:2110.04934·cs.CL·January 27, 2022·6 cites

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu

PDF

Open Access

TL;DR

Wav2vec-Switch introduces a contrastive learning approach that enhances noise robustness in speech representations by training on original-noisy speech pairs, improving speech recognition accuracy in noisy conditions.

Contribution

It proposes a novel contrastive learning method that enforces consistent predictions between original and noisy speech, significantly improving noise robustness in speech recognition models.

Findings

01

Achieves 2.9-4.9% relative WER reduction on synthesized noisy data.

02

Attains 5.7% relative WER reduction on real noisy CHiME-4 data.

03

Outperforms baseline data augmentation and matches or surpasses speech enhancement methods.

Abstract

The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider noise robustness which is crucial for real-world applications. In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets of each other. By doing this, it enforces the network to have consistent predictions for the original and noisy speech, thus allows to learn contextualized representation with noise robustness.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsContrastive Learning