Large scale weakly and semi-supervised learning for low-resource video ASR
Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick,, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig,, Abdelrahman Mohamed

TL;DR
This paper compares semi-supervised and weakly-supervised methods for low-resource social media video speech recognition, demonstrating significant WER improvements through large-scale experiments on Dutch and Romanian datasets.
Contribution
It provides a comprehensive large-scale comparison of self-labeling and metadata-based pretraining methods for low-resource speech recognition.
Findings
Sequence-level distillation yields 20% WER reduction.
All methods improve baseline WERs by over 8%.
Encoder-decoder models benefit most from distillation.
Abstract
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
