Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition
Chak-Fai Li, Francis Keith, William Hartmann, Matthew Snover

TL;DR
This paper explores combining unsupervised pretraining and text-augmented semi-supervised learning to improve low-resource autoregressive speech recognition, achieving significant reductions in word error rate.
Contribution
It introduces a novel approach that integrates unsupervised pretraining with semi-supervised learning and external language models for low-resource speech recognition.
Findings
Unsupervised pretraining outperforms traditional semi-supervised training.
Combining unsupervised and semi-supervised methods yields a 5% absolute WER improvement.
CTC-based decoding enhances the use of additional text data and improves performance.
Abstract
Recent advances in unsupervised representation learning have demonstrated the impact of pretraining on large amounts of read speech. We adapt these techniques for domain adaptation in low-resource -- both in terms of data and compute -- conversational and broadcast domains. Moving beyond CTC, we pretrain state-of-the-art Conformer models in an unsupervised manner. While the unsupervised approach outperforms traditional semi-supervised training, the techniques are complementary. Combining the techniques is a 5% absolute improvement in WER, averaged over all conditions, compared to semi-supervised training alone. Additional text data is incorporated through external language models. By using CTC-based decoding, we are better able to take advantage of the additional text data. When used as a transcription model, it allows the Conformer model to better incorporate the knowledge from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
