End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures
Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Tatiana Likhomanenko,, Edouard Grave, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan, Collobert

TL;DR
This paper explores semi-supervised learning for end-to-end speech recognition models using pseudo-labeling, demonstrating improved performance across architectures and establishing new state-of-the-art results on LibriSpeech.
Contribution
It introduces semi-supervised training with pseudo-labeling for various architectures, achieving new performance benchmarks and analyzing the impact of unlabeled data on model reliance on language models.
Findings
Semi-supervised training improves all architectures and loss functions.
Transformer models benefit from semi-supervision but still lag behind in supervised setting.
More unlabeled data reduces models' dependence on external language models.
Abstract
We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Machine Learning and Algorithms · Anomaly Detection Techniques and Applications
MethodsSigmoid Activation · Tanh Activation · Average Pooling · Global Average Pooling · 1x1 Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Bottleneck Residual Block · Max Pooling · Kaiming Initialization
