Improving Low-Resource Speech Recognition with Pretrained Speech Models:   Continued Pretraining vs. Semi-Supervised Training

Mitchell DeHaven; Jayadev Billa

arXiv:2207.00659·cs.CL·July 5, 2022·5 cites

Improving Low-Resource Speech Recognition with Pretrained Speech Models: Continued Pretraining vs. Semi-Supervised Training

Mitchell DeHaven, Jayadev Billa

PDF

Open Access

TL;DR

This paper compares continued pretraining and semi-supervised training for low-resource speech recognition, demonstrating that continued pretraining is more efficient and can achieve comparable or better results, especially when combined with pseudo-labeling.

Contribution

It introduces continued pretraining as an efficient alternative to semi-supervised training for low-resource ASR, with empirical evidence showing its effectiveness.

Findings

01

CoPT achieves similar or better WER than SST in low-resource languages.

02

Using CoPT for pseudo-labeling further improves WER.

03

CoPT is more computationally efficient than SST.

Abstract

Self-supervised Transformer based models, such as wav2vec 2.0 and HuBERT, have produced significant improvements over existing approaches to automatic speech recognition (ASR). This is evident in the performance of the wav2vec 2.0 based pretrained XLSR-53 model across many languages when fine-tuned with available labeled data. However, the performance from finetuning these models can be dependent on the amount of in-language or similar-to-in-language data included in the pretraining dataset. In this paper we investigate continued pretraining (CoPT) with unlabeled in-language audio data on the XLSR-53 pretrained model in several low-resource languages. CoPT is more computationally efficient than semi-supervised training (SST), the standard approach of utilizing unlabeled data in ASR, since it omits the need for pseudo-labeling of the unlabeled data. We show CoPT results in word error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing