Large scale weakly and semi-supervised learning for low-resource video   ASR

Kritika Singh; Vimal Manohar; Alex Xiao; Sergey Edunov; Ross Girshick,; Vitaliy Liptchinsky; Christian Fuegen; Yatharth Saraf; Geoffrey Zweig,; Abdelrahman Mohamed

arXiv:2005.07850·eess.AS·August 10, 2020

Large scale weakly and semi-supervised learning for low-resource video ASR

Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick,, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig,, Abdelrahman Mohamed

PDF

TL;DR

This paper compares semi-supervised and weakly-supervised methods for low-resource social media video speech recognition, demonstrating significant WER improvements through large-scale experiments on Dutch and Romanian datasets.

Contribution

It provides a comprehensive large-scale comparison of self-labeling and metadata-based pretraining methods for low-resource speech recognition.

Findings

01

Sequence-level distillation yields 20% WER reduction.

02

All methods improve baseline WERs by over 8%.

03

Encoder-decoder models benefit most from distillation.

Abstract

Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.