SOA: Reducing Domain Mismatch in SSL Pipeline by Speech Only Adaptation   for Low Resource ASR

Natarajan Balaji Shankar; Ruchao Fan; and Abeer Alwan

arXiv:2406.10512·eess.AS·June 18, 2024

SOA: Reducing Domain Mismatch in SSL Pipeline by Speech Only Adaptation for Low Resource ASR

Natarajan Balaji Shankar, Ruchao Fan, and Abeer Alwan

PDF

Open Access

TL;DR

This paper proposes Speech Only Adaptation (SOA), a simple method for domain adaptation of speech models that improves performance on target domains using only speech data, without retraining on labeled data.

Contribution

The paper introduces SOA, a novel speech-only adaptation technique for Wav2vec 2.0 that enhances domain transfer in low-resource ASR scenarios without additional labeled data.

Findings

01

Significant WER improvements on target domains

02

Preserves source domain performance

03

Effective in low-resource and domain mismatch settings

Abstract

Recently, speech foundation models have gained popularity due to their superiority in finetuning downstream ASR tasks. However, models finetuned on certain domains, such as LibriSpeech (adult read speech), behave poorly on other domains (child or noisy speech). One solution could be collecting as much labeled and diverse data as possible for joint finetuning on various domains. However, collecting target domain speech-text paired data and retraining the model is often costly and computationally expensive. In this paper, we introduce a simple yet effective method, speech only adaptation (SOA), based on speech foundation models (Wav2vec 2.0), which requires only speech input data from the target domain. Specifically, the Wav2vec 2.0 feature encoder is continually pretrained with the Wav2vec 2.0 loss on both the source and target domain data for domain adaptation, while the contextual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis