Training ASR models by Generation of Contextual Information

Kritika Singh; Dmytro Okhonko; Jun Liu; Yongqiang Wang; Frank Zhang,; Ross Girshick; Sergey Edunov; Fuchun Peng; Yatharth Saraf; Geoffrey Zweig,; Abdelrahman Mohamed

arXiv:1910.12367·cs.CL·February 18, 2020

Training ASR models by Generation of Contextual Information

Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang,, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig,, Abdelrahman Mohamed

PDF

TL;DR

This paper investigates the use of loosely related contextual information from social media videos to improve speech recognition models trained with limited labeled data, achieving significant WER reductions.

Contribution

It introduces a large-scale evaluation of weakly-supervised learning for ASR using social media data with contextual information, demonstrating notable improvements over supervised baselines.

Findings

01

20.8% WER reduction with weak supervision

02

13.4% WER reduction using only encoder fine-tuning

03

Improved encoder representations and language generation abilities

Abstract

Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax