Combining Unsupervised and Text Augmented Semi-Supervised Learning for   Low Resourced Autoregressive Speech Recognition

Chak-Fai Li; Francis Keith; William Hartmann; Matthew Snover

arXiv:2110.15836·cs.CL·February 14, 2022·1 cites

Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition

Chak-Fai Li, Francis Keith, William Hartmann, Matthew Snover

PDF

Open Access

TL;DR

This paper explores combining unsupervised pretraining and text-augmented semi-supervised learning to improve low-resource autoregressive speech recognition, achieving significant reductions in word error rate.

Contribution

It introduces a novel approach that integrates unsupervised pretraining with semi-supervised learning and external language models for low-resource speech recognition.

Findings

01

Unsupervised pretraining outperforms traditional semi-supervised training.

02

Combining unsupervised and semi-supervised methods yields a 5% absolute WER improvement.

03

CTC-based decoding enhances the use of additional text data and improves performance.

Abstract

Recent advances in unsupervised representation learning have demonstrated the impact of pretraining on large amounts of read speech. We adapt these techniques for domain adaptation in low-resource -- both in terms of data and compute -- conversational and broadcast domains. Moving beyond CTC, we pretrain state-of-the-art Conformer models in an unsupervised manner. While the unsupervised approach outperforms traditional semi-supervised training, the techniques are complementary. Combining the techniques is a 5% absolute improvement in WER, averaged over all conditions, compared to semi-supervised training alone. Additional text data is incorporated through external language models. By using CTC-based decoding, we are better able to take advantage of the additional text data. When used as a transcription model, it allows the Conformer model to better incorporate the knowledge from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing