Improving low-resource ASR performance with untranscribed out-of-domain data
Jayadev Billa

TL;DR
This paper demonstrates that in low-resource ASR, using out-of-domain web data with a two-stage training approach (pre-training on out-of-domain data then fine-tuning) significantly improves recognition accuracy.
Contribution
It introduces a simple yet effective semi-supervised training method that leverages out-of-domain web data for low-resource ASR, showing consistent WER improvements across multiple languages.
Findings
Up to 16.3% relative WER reduction over baseline
Training on out-of-domain data before fine-tuning yields better results
Pooling out-of-domain data with training data can sometimes decrease performance
Abstract
Semi-supervised training (SST) is a common approach to leverage untranscribed/unlabeled speech data to improve automatic speech recognition performance in low-resource languages. However, if the available unlabeled speech is mismatched to the target domain, SST is not as effective, and in many cases performs worse than the original system. In this paper, we address the issue of low-resource ASR when only untranscribed out-of-domain speech data is readily available in the target language. Specifically, we look to improve performance on conversational/telephony speech (target domain) using web resources, in particular YouTube data, which more closely resembles news/topical broadcast data. Leveraging SST, we show that while in some cases simply pooling the out-of-domain data with the training data lowers word error rate (WER), in all cases, we see improvements if we train first with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
