Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition
Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu

TL;DR
Wav2vec-Switch introduces a contrastive learning approach that enhances noise robustness in speech representations by training on original-noisy speech pairs, improving speech recognition accuracy in noisy conditions.
Contribution
It proposes a novel contrastive learning method that enforces consistent predictions between original and noisy speech, significantly improving noise robustness in speech recognition models.
Findings
Achieves 2.9-4.9% relative WER reduction on synthesized noisy data.
Attains 5.7% relative WER reduction on real noisy CHiME-4 data.
Outperforms baseline data augmentation and matches or surpasses speech enhancement methods.
Abstract
The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider noise robustness which is crucial for real-world applications. In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets of each other. By doing this, it enforces the network to have consistent predictions for the original and noisy speech, thus allows to learn contextualized representation with noise robustness.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsContrastive Learning
