Training ASR models by Generation of Contextual Information
Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang,, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig,, Abdelrahman Mohamed

TL;DR
This paper investigates the use of loosely related contextual information from social media videos to improve speech recognition models trained with limited labeled data, achieving significant WER reductions.
Contribution
It introduces a large-scale evaluation of weakly-supervised learning for ASR using social media data with contextual information, demonstrating notable improvements over supervised baselines.
Findings
20.8% WER reduction with weak supervision
13.4% WER reduction using only encoder fine-tuning
Improved encoder representations and language generation abilities
Abstract
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
